In a previous post we introduced Spark, a framework that will play an important role in the Big Data area. You can find a good starting point to understand what is Spark following this page from DataBricks, however let me reproduce an overview in this post.
Spark runs on top of existing Hadoop clusters to provide enhanced and additional functionality. Although Hadoop is effective for storing vast amounts of data cheaply, the computations it enables with MapReduce are highly limited. MapReduce is only able to execute simple computations and uses a high-latency batch model. Spark provides a more general and powerful alternative to Hadoop’s MapReduce, offering rich functionality such as stream processing, machine learning, and graph computations. Spark provides out of the box support for deploying within an existing Hadoop v1 cluster or a Hadoop v2 YARN cluster. Additionally, Spark has built-in scripts for launching on Amazon EC2 as I will post in a future post.
The Spark Ecosystem is composed by the following components:
Spark Core is the underlying general execution engine for the Spark platform that all other functionality is built on top of. It provides in-memory computing capabilities to deliver speed, a generalized execution model to support a wide variety of applications (and Java, Scala, and Python APIs for ease of development).
Shark is a SQL engine for Hive data that enables unmodified Hadoop Hive queries (interactive SQL queries for exploring data) to run up to 100x faster (according DataBricks) on existing deployments and data. It also provides powerful integration with the rest of the Spark ecosystem (e.g., integrating SQL query processing with machine learning).
Spark Streaming enables powerful interactive and analytical applications across both streaming (of new data in real-time) and historical data, while inheriting Spark’s ease of use and fault tolerance characteristics. It readily integrates with a wide variety of popular data sources, including HDFS, Flume, Kafka, and Twitter.
MLLib is a scalable machine learning library that delivers both high-quality algorithms (e.g., multiple iterations to increase accuracy) and blazing speed (up to 100x faster than MapReduce according DataBricks). The library is usable in Java, Scala, and Python as part of Spark applications, so that you can include it in complete workflows. Machine learning has quickly emerged as a critical piece in mining Big Data for actionable insights and MLLib will play an important role. Also mention that Apache Mahout community (a machine learning library for Hadoop since 2009) has decided to rework Mahout to support apache Spark.
There are other ongoing projects in the Spark community (that are still in alpha): BlinkDB is an approximate query engine for interactive SQL queries in Shark that allows users to trade-off query accuracy for response time. This enables interactive queries over massive data by using data samples and presenting results annotated with meaningful error bars. GraphX is a graph computation engine built on top of Spark that enables users to interactively build, transform and reason about graph structured data at scale. SparkR is a package for the R statistical language that enables R-users to leverage Spark functionality interactively from within the R shell.
The open source Apache Spark project can be downloaded from Apache. The site also contains installation instructions, video tutorials, and documentation to get you started.
I hope this explanation has been helpful to enter in this interesting big data framework. In future post I will enter in more details about Spark components.
UPDATE: DataBricks has announced today a new addition into the Spark ecosystem called Spark SQL. Spark SQL is separate from Shark, and does not use Hive under the hood. Thank you Marc de Palol for the pointer to this information!