Apache Spark is a fast and
general-purpose cluster computing system. It provides high-level APIs in Java,
Scala, Python and R, and an optimized engine that supports general execution
graphs. It also supports a rich set of higher-level tools including Spark SQL for SQL and structured data
processing, MLlib for machine learning, GraphX for graph processing, and Spark Streaming.
Is Apache Spark going to
replace Hadoop?
Hadoop is parallel data
processing framework that has traditionally been used to run map/reduce jobs.
These are long running jobs that take minutes or hours to complete. Spark has
designed to run on top of Hadoop and it is an alternative to the traditional
batch map/reduce model that can be used for real-time stream data processing
and fast interactive queries that finish within seconds. So, Hadoop supports
both traditional map/reduce and Spark.
We should look at Hadoop
as a general purpose Framework that supports multiple models and We should look
at Spark as an alternative to Hadoop MapReduce rather than a replacement to
Hadoop.
You can run Spark
without HDFS in a clustered mode as well. One of the easiest ways is to
use MapR where the file system is not HDFS, but you could also use Spark by
reading and writing data only to a system like Kafka.
Spark is independent. By
default there is no storage mechanism in Spark, so to store data, need fast and
scalable file system. you can use S3 or HDFS or any other file system, but if
you use Hadoop it's very low cost.
Hadoop MapReduce vs. Spark –Which One to Choose?
Spark uses more RAM
instead of network and disk I/O its relatively fast as compared to hadoop. But
as it uses large RAM it needs a dedicated high end physical machine for
producing effective results
It all depends and the
variables on which this decision depends keep on changing dynamically with
time.
Difference between Hadoop Mapreduce and Apache Spark
Spark stores data
in-memory whereas Hadoop stores data on disk. Hadoop uses replication to
achieve fault tolerance whereas Spark uses different data storage model,
resilient distributed datasets (RDD), uses a clever way of guaranteeing fault
tolerance that minimizes network I/O.
From the Spark academic
paper: "RDDs achieve fault tolerance through a notion of lineage: if a
partition of an RDD is lost, the RDD has enough information to rebuild just
that partition." This removes the need for replication to achieve fault
tolerance.
Do I need to learn Hadoop first to learn Apache Spark?
No, you don't need to
learn Hadoop to learn Spark. Spark was an independent project . But after YARN
and Hadoop 2.0, Spark became popular because Spark can run on top of HDFS along
with other Hadoop components. Spark has become another data processing engine
in Hadoop ecosystem and which is good for all businesses and community as it
provides more capability to Hadoop stack.
For developers, there is
almost no overlap between the two. Hadoop is a framework in which you write
MapReduce job by inheriting Java classes. Spark is a library that enables
parallel computation via function calls.
For operators to running
a cluster, there is an overlap in general skills, such as monitoring
configuration, and code deployment.
Features of Spark-
Speed
Run programs up to 100x
faster than Hadoop MapReduce in memory, or 10x faster on disk.
Ease of Use
Write applications
quickly in Java, Scala, Python, R.
Generality
Combine SQL, streaming,
and complex analytics.
Runs Everywhere
Spark runs on Hadoop,
Mesos, standalone, or in the cloud. It can access diverse data sources
including HDFS, Cassandra, HBase, and S3.
Spark’s major use cases over Hadoop
- Iterative Algorithms in Machine Learning
- Interactive Data Mining and Data Processing
- Spark is a fully Apache Hive-compatible data
warehousing system that can run 100x faster than Hive.
- Stream processing: Log processing and Fraud detection
in live streams for alerts, aggregates and analysis
- Sensor data processing: Where data is fetched and
joined from multiple sources, in-memory dataset really helpful as they are
easy and fast to process.
Spark Setup-
http://spark.apache.org/downloads.html
It’s easy to run locally on one machine — all you need is to have java installed on your system PATH, or the JAVA_HOME environment variable pointing to a Java installation.
Extract downloaded file.
Set path in bashrc file.
./bin/spark-shell
Now your spark shell will start and you can start scala programming
Read file from your local machine
scala>val textFile = sc.textFile("/usr/local/spark/aa.txt")
Few actions:
scala> textFile.count()
scala>textFile.first()
scala>val linesWithSpark = textFile.filter(line => line.contains("Spark"))
Run word count job in spark using scala-
scala> import java.lang.Math
scala> textFile.map(line => line.split(" ").size).reduce((a, b) => if (a > b) a else b)
scala> val wordCounts = textFile.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey((a, b) => a + b)
scala> wordCounts.collect()
Spark also provides a Python API. To run Spark interactively in a Python interpreter, use
./bin/pyspark
Spark also provides an experimental R API since 1.4 (only DataFrames APIs included). To run Spark interactively in a R interpreter, use
./bin/sparkR
Run word count using spark example directory-
hduser@priyanka-VM:/usr/local/spark/bin$ ./run-example org.apache.spark.examples.JavaWordCount /home/priyanka/Desktop/Big-Data/test.txt
Get the following output-
16/01/11 14:33:33 INFO SparkContext: Created broadcast 2 from broadcast at DAGScheduler.scala:874
16/01/11 14:33:33 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 1 (ShuffledRDD[4] at reduceByKey at JavaWordCount.java:61)
16/01/11 14:33:33 INFO TaskSchedulerImpl: Adding task set 1.0 with 1 tasks
16/01/11 14:33:33 INFO TaskSetManager: Starting task 0.0 in stage 1.0 (TID 1, localhost, PROCESS_LOCAL, 1239 bytes)
16/01/11 14:33:33 INFO Executor: Running task 0.0 in stage 1.0 (TID 1)
16/01/11 14:33:33 INFO ShuffleBlockFetcherIterator: Getting 1 non-empty blocks out of 1 blocks
16/01/11 14:33:33 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 13 ms
16/01/11 14:33:33 INFO Executor: Finished task 0.0 in stage 1.0 (TID 1). 1512 bytes result sent to driver
16/01/11 14:33:33 INFO DAGScheduler: ResultStage 1 (collect at JavaWordCount.java:68) finished in 0.159 s
16/01/11 14:33:33 INFO TaskSetManager: Finished task 0.0 in stage 1.0 (TID 1) in 170 ms on localhost (1/1)
16/01/11 14:33:33 INFO TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have all completed, from pool
16/01/11 14:33:33 INFO DAGScheduler: Job 0 finished: collect at JavaWordCount.java:68, took 3.167659 s
Professor: 1
Lecturer: 1
Business: 1
Sharma: 1
Shabbir: 1
Noida: 1
Kumar: 1
Bangalore: 1
9256458798: 1
8547987412: 1
Sampath: 1
Reddy: 2
: 170
Hyderabad: 2
Mohan: 1
Gurgaon: 1
Mahesh: 1
Engineer: 2
Manish: 1
Khan: 1
8521548932: 1
8882148796: 1
9314573259: 1
16/01/11 14:33:34 INFO SparkUI: Stopped Spark web UI at http://10.0.2.15:4040
16/01/11 14:33:34 INFO DAGScheduler: Stopping DAGScheduler
16/01/11 14:33:35 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
16/01/11 14:33:35 INFO Utils: path = /tmp/spark-a7153884-77c3-45e3-a6d8-be7452811e8f/blockmgr-57c09b89-2d01-4c30-a9ee-137b620d42d1, already present as root for deletion.
16/01/11 14:33:35 INFO MemoryStore: MemoryStore cleared
16/01/11 14:33:35 INFO BlockManager: BlockManager stopped
16/01/11 14:33:35 INFO BlockManagerMaster: BlockManagerMaster stopped
16/01/11 14:33:35 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
16/01/11 14:33:35 INFO RemoteActorRefProvider$RemotingTerminator: Shutting down remote daemon.
16/01/11 14:33:35 INFO RemoteActorRefProvider$RemotingTerminator: Remote daemon shut down; proceeding with flushing remote transports.
16/01/11 14:33:35 INFO RemoteActorRefProvider$RemotingTerminator: Remoting shut down.
16/01/11 14:33:35 INFO SparkContext: Successfully stopped SparkContext
16/01/11 14:33:35 INFO Utils: Shutdown hook called
16/01/11 14:33:35 INFO Utils: Deleting directory /tmp/spark-a7153884-77c3-45e3-a6d8-be7452811e8f
No comments:
Post a Comment