Apache Spark is an open source, distributed computing framework and set of libraries for real-time, large-scale data processing. Spark was created in 2009 at UC Berkeley to address many of Apache Hadoop’s shortcomings, and is much faster than Hadoop for analytic workloads because it stores data in-memory (RAM) rather than on disk.
Spark has many built-in libraries that implement machine learning algorithms as parallel processing jobs, making them easy to parallelize across many compute resources. Spark is the most actively developed open-source framework for large-scale data processing.
Spark applications consist of a driver process and a set of executor processes. The driver process is responsible for three things:
The executors are responsible for executing code assigned to it by the driver and reporting the state of the computation, on that executor, back to the driver node.
Source: Apache Spark
In general, Spark will be most appropriate when your data cannot fit into memory on a single machine – i.e., data greater than hundreds of gigabytes. Some of the most popular use cases for Spark include:
Spark involves more processing overhead and a more complicated set-up than other data processing options. Alternatives such as Ray and Dask have recently emergedd.