For Developers: Spark Introduction and Setup

Apache Spark is a fast and general-purpose cluster computing system. It provides high-level APIs in Java, Scala, Python and R, and an optimized engine that supports general execution graphs. It also supports a rich set of higher-level tools including Spark SQL for SQL and structured data processing, MLlib for machine learning, GraphX for graph processing, and Spark Streaming.

Is Apache Spark going to replace Hadoop?

Hadoop is parallel data processing framework that has traditionally been used to run map/reduce jobs. These are long running jobs that take minutes or hours to complete. Spark has designed to run on top of Hadoop and it is an alternative to the traditional batch map/reduce model that can be used for real-time stream data processing and fast interactive queries that finish within seconds. So, Hadoop supports both traditional map/reduce and Spark.

We should look at Hadoop as a general purpose Framework that supports multiple models and We should look at Spark as an alternative to Hadoop MapReduce rather than a replacement to Hadoop.

You can run Spark without HDFS in a clustered mode as well. One of the easiest ways is to use MapR where the file system is not HDFS, but you could also use Spark by reading and writing data only to a system like Kafka.

Spark is independent. By default there is no storage mechanism in Spark, so to store data, need fast and scalable file system. you can use S3 or HDFS or any other file system, but if you use Hadoop it's very low cost.

Hadoop MapReduce vs. Spark –Which One to Choose?

Spark uses more RAM instead of network and disk I/O its relatively fast as compared to hadoop. But as it uses large RAM it needs a dedicated high end physical machine for producing effective results

It all depends and the variables on which this decision depends keep on changing dynamically with time.

Difference between Hadoop Mapreduce and Apache Spark

Spark stores data in-memory whereas Hadoop stores data on disk. Hadoop uses replication to achieve fault tolerance whereas Spark uses different data storage model, resilient distributed datasets (RDD), uses a clever way of guaranteeing fault tolerance that minimizes network I/O.

From the Spark academic paper: "RDDs achieve fault tolerance through a notion of lineage: if a partition of an RDD is lost, the RDD has enough information to rebuild just that partition." This removes the need for replication to achieve fault tolerance.

Do I need to learn Hadoop first to learn Apache Spark?

No, you don't need to learn Hadoop to learn Spark. Spark was an independent project . But after YARN and Hadoop 2.0, Spark became popular because Spark can run on top of HDFS along with other Hadoop components. Spark has become another data processing engine in Hadoop ecosystem and which is good for all businesses and community as it provides more capability to Hadoop stack.

For developers, there is almost no overlap between the two. Hadoop is a framework in which you write MapReduce job by inheriting Java classes. Spark is a library that enables parallel computation via function calls.

For operators to running a cluster, there is an overlap in general skills, such as monitoring configuration, and code deployment.

Features of Spark-

Speed

Run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk.

Ease of Use

Write applications quickly in Java, Scala, Python, R.

Generality

Combine SQL, streaming, and complex analytics.

Runs Everywhere

Spark runs on Hadoop, Mesos, standalone, or in the cloud. It can access diverse data sources including HDFS, Cassandra, HBase, and S3.

Spark’s major use cases over Hadoop

Iterative Algorithms in Machine Learning
Interactive Data Mining and Data Processing
Spark is a fully Apache Hive-compatible data warehousing system that can run 100x faster than Hive.
Stream processing: Log processing and Fraud detection in live streams for alerts, aggregates and analysis
Sensor data processing: Where data is fetched and joined from multiple sources, in-memory dataset really helpful as they are easy and fast to process.

Spark Setup-

Download spark from-

http://spark.apache.org/downloads.html

It’s easy to run locally on one machine — all you need is to have java installed on your system PATH, or the JAVA_HOME environment variable pointing to a Java installation.

Extract downloaded file.

Set path in bashrc file.

./bin/spark-shell

Now your spark shell will start and you can start scala programming

Read file from your local machine
scala>val textFile = sc.textFile("/usr/local/spark/aa.txt")
Few actions:
scala> textFile.count()

scala>textFile.first()

scala>val linesWithSpark = textFile.filter(line => line.contains("Spark"))

Run word count job in spark using scala-

scala> import java.lang.Math
scala> textFile.map(line => line.split(" ").size).reduce((a, b) => if (a > b) a else b)

scala> val wordCounts = textFile.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey((a, b) => a + b)

scala> wordCounts.collect()

Spark also provides a Python API. To run Spark interactively in a Python interpreter, use
./bin/pyspark

Spark also provides an experimental R API since 1.4 (only DataFrames APIs included). To run Spark interactively in a R interpreter, use
./bin/sparkR

Run word count using spark example directory-
hduser@priyanka-VM:/usr/local/spark/bin$ ./run-example org.apache.spark.examples.JavaWordCount /home/priyanka/Desktop/Big-Data/test.txt

Get the following output-
16/01/11 14:33:33 INFO SparkContext: Created broadcast 2 from broadcast at DAGScheduler.scala:874
16/01/11 14:33:33 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 1 (ShuffledRDD[4] at reduceByKey at JavaWordCount.java:61)
16/01/11 14:33:33 INFO TaskSchedulerImpl: Adding task set 1.0 with 1 tasks
16/01/11 14:33:33 INFO TaskSetManager: Starting task 0.0 in stage 1.0 (TID 1, localhost, PROCESS_LOCAL, 1239 bytes)
16/01/11 14:33:33 INFO Executor: Running task 0.0 in stage 1.0 (TID 1)
16/01/11 14:33:33 INFO ShuffleBlockFetcherIterator: Getting 1 non-empty blocks out of 1 blocks
16/01/11 14:33:33 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 13 ms
16/01/11 14:33:33 INFO Executor: Finished task 0.0 in stage 1.0 (TID 1). 1512 bytes result sent to driver
16/01/11 14:33:33 INFO DAGScheduler: ResultStage 1 (collect at JavaWordCount.java:68) finished in 0.159 s
16/01/11 14:33:33 INFO TaskSetManager: Finished task 0.0 in stage 1.0 (TID 1) in 170 ms on localhost (1/1)
16/01/11 14:33:33 INFO TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have all completed, from pool
16/01/11 14:33:33 INFO DAGScheduler: Job 0 finished: collect at JavaWordCount.java:68, took 3.167659 s
Professor: 1
Lecturer: 1
Business: 1
Sharma: 1
Shabbir: 1
Noida: 1
Kumar: 1
Bangalore: 1
9256458798: 1
8547987412: 1
Sampath: 1
Reddy: 2
: 170
Hyderabad: 2
Mohan: 1
Gurgaon: 1
Mahesh: 1
Engineer: 2
Manish: 1
Khan: 1
8521548932: 1
8882148796: 1
9314573259: 1
16/01/11 14:33:34 INFO SparkUI: Stopped Spark web UI at http://10.0.2.15:4040
16/01/11 14:33:34 INFO DAGScheduler: Stopping DAGScheduler
16/01/11 14:33:35 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
16/01/11 14:33:35 INFO Utils: path = /tmp/spark-a7153884-77c3-45e3-a6d8-be7452811e8f/blockmgr-57c09b89-2d01-4c30-a9ee-137b620d42d1, already present as root for deletion.
16/01/11 14:33:35 INFO MemoryStore: MemoryStore cleared
16/01/11 14:33:35 INFO BlockManager: BlockManager stopped
16/01/11 14:33:35 INFO BlockManagerMaster: BlockManagerMaster stopped
16/01/11 14:33:35 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
16/01/11 14:33:35 INFO RemoteActorRefProvider$RemotingTerminator: Shutting down remote daemon.
16/01/11 14:33:35 INFO RemoteActorRefProvider$RemotingTerminator: Remote daemon shut down; proceeding with flushing remote transports.
16/01/11 14:33:35 INFO RemoteActorRefProvider$RemotingTerminator: Remoting shut down.
16/01/11 14:33:35 INFO SparkContext: Successfully stopped SparkContext
16/01/11 14:33:35 INFO Utils: Shutdown hook called
16/01/11 14:33:35 INFO Utils: Deleting directory /tmp/spark-a7153884-77c3-45e3-a6d8-be7452811e8f

For Developers

Tuesday, 12 January 2016

Spark Introduction and Setup

No comments:

Post a Comment