Spark is a distributed computing framework that has skyrocketed in popularity over the last several years for data engineering and analytics use cases. This paper provides a brief overview of Spark’s strengths and weaknesses in the context of data science and machine learning workflows.
While Spark is extremely effective with certain types of workloads on very large datasets, it has some drawbacks, including performance overhead for certain workloads, onerous setup and management, and competition from more modern distributed computing frameworks. It is smart for enterprises to understand the pros and cons of Spark so they can implement an analytics technology strategy that incorporates Spark for projects that can benefit from it, and support alternative options when its complexity is unnecessary or even detrimental to the business.