November 13–15, 2018 - Shanghai, China
Click Here For Information & Registration

To view the Chinese version of this schedule please go here.

Simultaneous translation will be provided for all keynote and breakout sessions.
Wednesday, November 14 • 16:20 - 16:55
Production Cluster Monitoring and Remediation for High Reliability at eBay - Shijun Qian & YingKe Liu, eBay

Sign up or log in to save this to your schedule and see who's attending!

Feedback form is now closed.
eBay runs dozens of Kubernetes clusters across global data centers in different regions. Tens of thousands of nodes support eBay core services such as search and big data. Complex large cross-regional production clusters and the extremely high cluster stability required workloads make monitoring and remediation a huge challenge for us. Based on Prometheus federation, component assertions, metric exporters and our own monitoring tools, we built a series of clear dashboards, and then we implemented a complete cross-clusters remediation flow, incident management, and monitoring automation. In this talk, we hope to share our large-scale Kubernetes production clusters monitoring experience and future thoughts.


YingKe Liu

MTS1, Software Engineer/ 高级软件工程师, eBay
A senior software engineer in eBay, working in host-runtime sig which focuses on reliability of OS/kernel/docker. More than 10 years’ experience working in software development./负责ebay kubernetes集群节点的OS/Kernel/docker的可靠性方面的DevOps,10... Read More →
avatar for Shijun Qian

Shijun Qian

Software Engineer 软件工程师, eBay
Shijun (Daniel) Qian works on eBay's cloud team. He has a wide range of interests in many aspects of cloud native computing, mainly focused on monitoring, cluster health management, and networking.He is also an active open source contributor (github: @danielqsj):1. Sponsor and maintainer... Read More →

Wednesday November 14, 2018 16:20 - 16:55
302 B
  • Skill Level Any