Apache Spark Multi-node setup

In this article, I’d be showing how to setup a 2-node Spark cluster. (i.e., a master and a slave/worker node)

Download the suitable distribution from Apache’s Spark website (select the version and the package type you’d want).

export SPARK_HOME=/ebs/apps/spark

export SPARK_HOME=/ebs/apps/spark
Enable ssh connectivity between master and slaves.

sudo apt-get install openssh-server
Configure Spark from commandline options. Following are the options for Spark master:
 -i HOST, --ip HOST Hostname to listen on (deprecated, please use --host or -h)
 -h HOST, --host HOST Hostname to listen on
 -p PORT, --port PORT Port to listen on (default: 7077)
 --webui-port PORT Port for web UI (default: 8080)
 --properties-file FILE Path to a custom Spark properties file.
 Default is conf/spark-defaults.conf.

More details here.

Set master node host IP to: 0.0.0.0

This bind IP enables connectivity from anywhere.

Master:

./sbin/start-master.sh -h 0.0.0.0

Slave:

./sbin/start-slave.sh spark://<master-hostname-ip>:7077
Advertisements

Run Apache Spark on Windows (yeah, I know!)

Download the suitable distribution from Apache’s Spark website (select the version and the package type you’d want).

Download winutils.exe from here or here.

Place it in a directory (Maybe, C:/Hadoop/bin/winutils.exe), go to the directory containing winutils.exe and run the following command.

winutils.exe chmod 777 /tmp/hive

You need to set environment variables HADOOP_HOME and spark.driver.host before you proceed further.

Set HADOOP_HOME = C:/Hadoop (Note: winutils.exe is taken from %HADOOP_HOME%/bin, so point HADOOP_HOME just to the root directory)

spark-setup1

Set spark.driver.host=localhost

spark-setup2

Run <your-spark-directory>/bin/spark-shell.cmd

Run this in your browser!

http://localhost:4040/

For testing Spark, create a file called test.json and add the following data into it.

{
    "glossary": {
        "title": "example glossary",
		"GlossDiv": {
            "title": "S",
			"GlossList": {
                "GlossEntry": {
                    "ID": "SGML",
					"SortAs": "SGML",
					"GlossTerm": "Standard Generalized Markup Language",
					"Acronym": "SGML",
					"Abbrev": "ISO 8879:1986",
					"GlossDef": {
                        "para": "A meta-markup language, used to create markup languages such as DocBook.",
						"GlossSeeAlso": ["GML", "XML"]
                    },
					"GlossSee": "markup"
                }
            }
        }
    }
}

Run the following commands in Spark shell in sequence and test it:


val sc: SparkContext // An existing SparkContext.
val sqlContext = new org.apache.spark.sql.SQLContext(sc)

val df = sqlContext.read.json("path/test.json")

// Displays the content of the DataFrame to stdout
df.show()

References: