Open Sourcing Peloton, Uber’s Unified Resource Scheduler
source link: https://www.tuicool.com/articles/hit/UF3uIrn
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
First introduced by Uber in November 2018, Peloton , a unified resource scheduler, manages resources across distinct workloads, combining separate compute clusters. Peloton is designed for web-scale companies like Uber with millions of containers and tens of thousands of nodes. Peloton features advanced resource management capabilities such as elastic resource sharing, hierarchical max-min fairness, resource overcommits, and workload preemption. As a cloud-agnostic system, Peloton can be run in on-premise data centers or in the cloud.
At Uber, Peloton is a critical piece of infrastructure powering our compute clusters. It is currently running many kinds of batch workloads in production, and we are starting to migrate stateless services workloads to it as well.
Today, Uber is excited to announce that we are open sourcing Peloton. By allowing others in the cluster management community to leverage unified schedulers and workload co-location, Peloton will open the door for more efficient resource utilization and management across the community. Moreover, open sourcing Peloton will enable greater industry collaboration and open up the software to feedback and contributions from industry engineers, independent developers, and academics across the world.
Benefits of using Peloton
To our knowledge, there is no other open source scheduler which combines all types of workloads for web-scale companies. Prior to Peloton, each workload at Uber had its own cluster, resulting in many inefficiencies.
As shown by the Google Borg paper , co-locating diverse workloads on shared clusters is key to improving cluster utilization and reducing overall cluster cost . Below, we outline some examples of how co-locating mixed workloads drives utilization in our cluster, as well as helps us more accurately plan cluster provisioning:
- Resource overcommitment and job preemption are key to improving cluster resource utilization. However, it is very expensive to preempt online jobs, such as stateless or stateful services that are often latency sensitive. Hence, preventing preemption of these latency-sensitive jobs requires us to co-locate batch jobs that are low-priority and preemptible on the same cluster, enabling us to better utilize overcommitted resources.
- As Uber services move towards an active-active architecture , we will have capacity reserved for disaster recovery (DR) in each data center. That DR capacity can be used for batch jobs until data center failover occurs. Also, sharing clusters with mixed workloads means we no longer need to buy extra DR capacity for online and batch workloads separately.
- Uber’s online workloads spike during big events like Halloween or New Year’s Eve . We need to plan capacity for these high-traffic events well in advance, requiring us to buy hardware separately for online and batch jobs. During the rest of the year, this extra hardware is underutilized, leading to extra, and unnecessary, technical costs. By co-locating both workloads on the same cluster, we can lend capacity from batch workloads to online workloads for those spikes without buying extra hardware.
- Different workloads have resource profiles that are often complementary to each other. For example, stateful services or batch jobs might be disk IO intensive but stateless services often use little disk IO. Given these profiles, it makes more sense to co-locate stateful services with batch jobs on the same cluster.
Realizing that these scenarios would enable us to achieve greater operational efficiency, improve capacity planning, and optimize resource sharing, it was evident that we needed to co-locate different workloads together on one single, shared compute platform. A unified resource scheduler enables us to manage all kinds of workloads to use our resources as efficiently as possible both in private data centers and the cloud.
Peloton is going to support all of Uber’s workloads with a single, shared platform, balancing resource usage by elastically sharing resources, and helping teams better plan for future capacity needs. Learn more about these benefits by reading our recent article on Peloton.
Features in the current release
Uber has been running Peloton in production for more than a year and it’s scaling and running very well. Below are some of the feature highlights.
- Elastic Resource Sharing : Support hierarchical resource pools to elastically share resources among different teams.
- Resource Overcommit and Task Preemption : Improve cluster utilization by scheduling workloads using slack resources and preempting best effort workloads.
- Optimized for Big Data Workloads : Support advanced Apache Spark features such as dynamic resource allocation.
- Optimized for Machine Learning : Support GPU and Gang scheduling for TensorFlow and Horovod .
- Protobuf/gRPC-based API : Support most of the language bindings such as Golang, Java, Python, and Node.js.
- Co-scheduling Mixed Workloads : Support mixed workloads such as batch, stateless, and stateful jobs in a single cluster.
- High Scalability : Scale to millions of containers and tens of thousands of nodes as shown in our benchmark tests in our recent Kubecon Talk .
Uber’s Peloton team is also working on stateless service support, coming soon. Please visit the Next Steps section of our article on Peloton for more details.
Get started
We hope you try out Peloton for yourself ! Learn more by reading our recent article on Peloton and our documentation or joining our Slack channel with any questions about the software.
Comments
Popular Articles
Uber’s Big Data Platform: 100+ Petabytes with Minute Latency
October 17, 2018
Introducing Ludwig, a Code-Free Deep Learning Toolbox
February 11, 2019
Meet Michelangelo: Uber’s Machine Learning Platform
September 5, 2017
Forecasting at Uber: An Introduction
September 6, 2018
Why Uber Engineering Switched from Postgres to MySQL
July 26, 2016
Scaling Machine Learning at Uber with Michelangelo
November 2, 2018
Recommend
-
52
除非特别声明,此文章内容采用知识共享署名 3.0许可,代码示例采用Apache 2.0许可。更多细节请查看我们的服务条款。
-
85
除非特别声明,此文章内容采用知识共享署名 3.0许可,代码示例采用Apache 2.0许可。更多细节请查看我们的服务条款。
-
59
Open sourcing the Firebase Android SDK 2018-09-07a...
-
39
At Facebook, our services are designed to recover automatically from a major outage, such as the loss of a data center due to a natural disaster. Most of our production services are built in-house and these all run in con...
-
28
By Min Cai & Mayank Bansal Cluster management, a common software infrastructure among technology companies, aggregates compute resources from a collection of physical hosts into a shared resource pool, amplif...
-
32
The long-standing practice of data sharing in genomics can be traced to the Bermuda principles , which were formulated during the human genome project (
-
6
Towards A Unified Theory Of Peloton 4933 members Technology Technology on Digg: the best articles, videos, tweets, and orig...
-
11
OverviewThe Angular Scheduler component can display frozen rows at the top and/or bottom of the grid.You can use the dynamic content generation events...
-
2
FeaturesThis tutorial shows how to manage resources (displayed as rows) using the built-in UI elements.Angular Scheduler component that displays a ti...
-
1
Peloton set to report earnings ahead of Thursday's open
About Joyk
Aggregate valuable and interesting links.
Joyk means Joy of geeK