![]() The graph clearly shows that we encounter diminishing returns after adding only 5 machines to the cluster and past a cluster size of 15 machines, adding more machines to the cluster won’t speed up the job.Īfter using cluster size to scale JetBlue’s business metrics Spark job, we came to an unfortunate realization. The code for the job can be found in the Resources section below. This is due to added communication overheads or simply because there is not enough natural partitioning in the data to enable efficient distributed processing.įigure 1 below demonstrates the aforementioned cluster-size related Spark scaling limit with the example of a simple word-count job. For many Spark jobs, including JetBlue’s, there is a ceiling on the speed-ups that can be gained by simply adding more workers to the Spark cluster: past a certain point, adding more workers won’t significantly decrease processing times. ![]() Cluster Size and Spark Job Processing TimeĪfter implementing the business metrics Spark job with JetBlue, we immediately faced a scaling concern. To keep the code as straightforward as possible, we therefore wanted to implement the business metrics Spark jobs in a direct and easy-to-follow way, and to have a single parameterized Spark job that computes the metrics for a given booking day. These two constraints were immediately at odds: a natural way to scale jobs in Spark is to leverage partitioning and operate on larger batches of data in one go however, this complicates code understanding and performance tuning since developers must be familiar with partitioning, balancing data across partitions, etc. Second, to keep business metrics relevant for JetBlue decision-makers, all re-computations should terminate within a few minutes. First, to aid with maintainability and onboarding, all Spark code should be simple and easily understandable even to novices in the technology. Developing a SolutionĪt the outset of the project, we had two key solution constraints: time and simplicity. The remainder of this article will walk through various scaling techniques for dealing with scenarios that require large numbers of Spark jobs to be run on Azure Databricks and present solutions that were able to reduce processing times by over 60% compared to our initial solution. ![]() This poses an interesting scaling challenge for the Spark job computing the metrics: how do we keep the metrics production code simple and readable while still being able to re-process metrics for hundreds of days in a timely fashion? To keep business metrics fresh, each batch file must result in the re-computation of the metrics for each day listed in the file. For example, a batch file on January 10 may include a newly created future booking for February 2, an upgrade to a reservation for a flight on March 5, or a listing of customers who flew on all flights on January 10. With over 1000 daily flights servicing more than 100 cities and 42 million customers per year, JetBlue has a lot of data to crunch, answering questions such as: What is the utilization of a given route? What is the projected load of a flight? How many flights were on-time? What is the idle time of each plane model at a given airport? To provide decision-makers answers to these and other inquiries in a timely fashion, JetBlue partnered with Microsoft to develop a flexible and extensible reporting solution based on Apache Spark and Azure Databricks.Ī key data source for JetBlue is a recurring batch file which lists all customer bookings created or changed during the last batch period. In today’s fast-moving world, having access to up-to-date business metrics is key to making data-driven customer-centric decisions.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |