What is the difference between Spark, R, Python, and Hadoop in Data Science?

Frameworks: Hadoop - Spark

Languages: Python - R

———————————

Hadoop Framework:

It is commonly used for “Big Data”, where its main concepts are: “distributed storage” commonly known as “HDFS” of the data on multiple nodes/computer-clusters, and “distributed processing” commonly known as “Map Reduce jobs” on the computer-clusters.

So the concept originally comes from the fact that if you have lots of data that can’t be processed in the needed time on your computer, you start distributing the storage and processing of it on multiple computers.

Programmers use “Java” in order to write the map-reduce jobs, however it has lots of applications in the framework that made writing map-reduce jobs easier.

Spark Framework:

It is also a framework that was developed due to some limitations in “Hadoop Map Reduce” where the paradigm read data from disk, map specific function across the data, and then reduce the results of the map and store the results on the disk. (So the main problem is that processing was done on the disk using persistent storage)

Therefore Spark was developed where it uses “in-memory processing” thus it comes with higher latency (it runs faster) by using resilient distributed data-sets (RDDSs).

———————————

R:

It is open-source statistical programming language that is mostly used by statisticians, data scientists, data analysts…etc.

The power of R relies on its packages that allow you to manipulate data-sets, wrangle them, analyze them using visualization, statistical methods, data mining and machine learning models and so on.

You can use R as a programming language on Hadoop by using rhive, or on Spark using rSpark.

———————————

Python:

It is a high-level programming language for general use which can be used for different things from building a web-site to analyzing data like R.

You can write Map-reduce jobs on Hadoop using Jython or use python on spark using pyspark.

 

Follow: Nareman Darwish