Apache Spark Multi-node setup

In this article, I’d be showing how to setup a 2-node Spark cluster. (i.e., a master and a slave/worker node)

Download the suitable distribution from Apache’s Spark website (select the version and the package type you’d want).

export SPARK_HOME=/ebs/apps/spark

export SPARK_HOME=/ebs/apps/spark
Enable ssh connectivity between master and slaves.

sudo apt-get install openssh-server
Configure Spark from commandline options. Following are the options for Spark master:
 -i HOST, --ip HOST Hostname to listen on (deprecated, please use --host or -h)
 -h HOST, --host HOST Hostname to listen on
 -p PORT, --port PORT Port to listen on (default: 7077)
 --webui-port PORT Port for web UI (default: 8080)
 --properties-file FILE Path to a custom Spark properties file.
 Default is conf/spark-defaults.conf.

More details here.

Set master node host IP to: 0.0.0.0

This bind IP enables connectivity from anywhere.

Master:

./sbin/start-master.sh -h 0.0.0.0

Slave:

./sbin/start-slave.sh spark://<master-hostname-ip>:7077

Run Apache Spark on Windows (yeah, I know!)

Download the suitable distribution from Apache’s Spark website (select the version and the package type you’d want).

Download winutils.exe from here or here.

Place it in a directory (Maybe, C:/Hadoop/bin/winutils.exe), go to the directory containing winutils.exe and run the following command.

winutils.exe chmod 777 /tmp/hive

You need to set environment variables HADOOP_HOME and spark.driver.host before you proceed further.

Set HADOOP_HOME = C:/Hadoop (Note: winutils.exe is taken from %HADOOP_HOME%/bin, so point HADOOP_HOME just to the root directory)

spark-setup1

Set spark.driver.host=localhost

spark-setup2

Run <your-spark-directory>/bin/spark-shell.cmd

Run this in your browser!

http://localhost:4040/

For testing Spark, create a file called test.json and add the following data into it.

{
    "glossary": {
        "title": "example glossary",
		"GlossDiv": {
            "title": "S",
			"GlossList": {
                "GlossEntry": {
                    "ID": "SGML",
					"SortAs": "SGML",
					"GlossTerm": "Standard Generalized Markup Language",
					"Acronym": "SGML",
					"Abbrev": "ISO 8879:1986",
					"GlossDef": {
                        "para": "A meta-markup language, used to create markup languages such as DocBook.",
						"GlossSeeAlso": ["GML", "XML"]
                    },
					"GlossSee": "markup"
                }
            }
        }
    }
}

Run the following commands in Spark shell in sequence and test it:


val sc: SparkContext // An existing SparkContext.
val sqlContext = new org.apache.spark.sql.SQLContext(sc)

val df = sqlContext.read.json("path/test.json")

// Displays the content of the DataFrame to stdout
df.show()

References:

Configure multiple Maven repositories

What are build profiles?

Apache Maven 2.0 goes to great lengths to ensure that builds are portable. Among other things, this means allowing build configuration inside the POM, avoiding all filesystem references (in inheritance, dependencies, and other places), and leaning much more heavily on the local repository to store the metadata needed to make this possible.

However, sometimes portability is not entirely possible. Under certain conditions, plugins may need to be configured with local filesystem paths. Under other circumstances, a slightly different dependency set will be required, and the project’s artifact name may need to be adjusted slightly. And at still other times, you may even need to include a whole plugin in the build lifecycle depending on the detected build environment.

To address these circumstances, Maven 2.0 introduces the concept of a build profile. Profiles are specified using a subset of the elements available in the POM itself (plus one extra section), and are triggered in any of a variety of ways. They modify the POM at build time, and are meant to be used in complementary sets to give equivalent-but-different parameters for a set of target environments (providing, for example, the path of the appserver root in the development, testing, and production environments). As such, profiles can easily lead to differing build results from different members of your team. However, used properly, profiles can be used while still preserving project portability. This will also minimize the use of -f option of maven which allows user to create another POM with different parameters or configuration to build which makes it more maintainable since it is runnning with one POM only.

What are the different types of profile? Where is each defined?

 

If you would like to point to multiple Maven repositories, create “settings.xml” and add the following code:

<?xml version="1.0" encoding="UTF-8" ?>
<settings xsi:schemaLocation="http://maven.apache.org/SETTINGS/1.1.0 http://maven.apache.org/xsd/settings-1.1.0.xsd" xmlns="http://maven.apache.org/SETTINGS/1.1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">

    <servers>

        <server>
            <username>user1</username>
            <password>password123</password>
            <id>central</id>
        </server>

        <server>
            <username>user2</username>
            <password>password123</password>
            <id>snapshots</id>
        </server>

    </servers>

    <profiles>

        <profile>

            <repositories>

                <repository>
                    <snapshots>
                        <enabled>false</enabled>
                    </snapshots>
                    <id>central</id>
                    <name>libs-release</name>
                    <url>https://mavenrepo.customdomain/artifactory/libs-release</url>
                </repository>

                <repository>
                    <snapshots />
                    <id>snapshots</id>
                    <name>libs-snapshot</name>
                    <url>https://mavenrepo.customdomain/artifactory/libs-snapshot</url>
                </repository>

                <repository>
                    <releases>
                        <enabled>true</enabled>
                    </releases>
                    <snapshots>
                        <enabled>true</enabled>
                    </snapshots>
                    <id>central</id>
                    <url>http://repo1.maven.org/maven2</url>
                </repository>

            </repositories>

            <pluginRepositories>

                <pluginRepository>
                    <snapshots>
                        <enabled>false</enabled>
                    </snapshots>
                    <id>central</id>
                    <name>plugins-release</name>
                    <url>https://mavenrepo.customdomain/artifactory/plugins-release</url>
                </pluginRepository>

                <pluginRepository>
                    <snapshots />
                    <id>snapshots</id>
                    <name>plugins-snapshot</name>
                    <url>https://mavenrepo.customdomain/artifactory/plugins-snapshot</url>
                </pluginRepository>

                <pluginRepository>
                    <releases>
                        <enabled>true</enabled>
                    </releases>
                    <snapshots>
                        <enabled>true</enabled>
                    </snapshots>
                    <id>central</id>
                    <url>http://repo1.maven.org/maven2</url>
                </pluginRepository>

            </pluginRepositories>

            <id>artifactory</id>

        </profile>

    </profiles>

    <activeProfiles>
        <activeProfile>artifactory</activeProfile>
    </activeProfiles>

</settings>

Now place this file in $USER_HOME/.m2/ directory.

This will enable connection with multiple repositories.

More documentation here.

Working with self-hosted Maven central repository

Ignore server SSL certificate validity check:

mvn -Dmaven.wagon.http.ssl.insecure=true -Dmaven.wagon.http.ssl.allowall=true -Dmaven.wagon.http.ssl.ignore.validity.dates=true

 

If you wish to add this setting globally, add an alias to mvn command:

In Linux (Ubuntu), edit ~/.bashrc and add the following line in the bottom of the file.

alias mvn="mvn -Dmaven.wagon.http.ssl.insecure=true -Dmaven.wagon.http.ssl.allowall=true -Dmaven.wagon.http.ssl.ignore.validity.dates=true"

Now, put this change into effect immediately.

source ~/.bashrc

Now, whenever you execute a mvn command, it always substitutes the command with mvn -Dmaven.wagon.http.ssl.insecure=true -Dmaven.wagon.http.ssl.allowall=true -Dmaven.wagon.http.ssl.ignore.validity.dates=true, which will always ignore the SSL certificate check.

What do these params mean?

  • -Dmaven.wagon.http.ssl.insecure=true – enable use of relaxed SSL check for user generated certificates.
  • -Dmaven.wagon.http.ssl.allowall=true – enable match of the server’s X.509 certificate with hostname. If disabled, a browser like check will be used.
  • -Dmaven.wagon.http.ssl.ignore.validity.dates=true – ignore issues with certificate dates.

 

Official documentation can be found here.

NOTE: The above method was purely intended for developers who constantly keep changing their machines and don’t wish to setup the SSL certificate in every machine. This is for those of you who are lazy! 🙂

Right way of setting up SSL certificates:

Get the valid SSL certificate file from certificate distributor/admin and place it in your machine.

Import the certificate into your Java environment:

keytool -import -keystore /path/to/java_home/lib/security/cacerts -alias mavenrepo -file /path/to/cer/file

You’re now good to go.

NOTE: Now this is a sane method of setting up SSL certificate check for your self-hosted Maven repository

Mirror/Clone existing websites from the internet

With Unix tool ‘wget‘, you can download the contents of the whole website and have a local copy.


wget --mirror -p --convert-links -P ./LOCAL-DIR WEBSITE-URL

Arguments:

  • mirror : turn on options suitable for mirroring.
  • p : download all files that are necessary to properly display a given HTML page.
  • convert-links : after the download, convert the links in document for local viewing.
  • P ./LOCAL-DIR : save all the files and directories to the specified directory.

Note:

Cloning, reusing and copying content from websites may be illegal.

Using Git

So, the things around the world are changing rapidly. We’re in the era where almost every person has a smartphone. People talk to others more online and less offline (it’s their fingers which talk, and not their tongue). This has hugely impacted the software industry. Software development and computing is shooting to the sky. More mobile phone variants, more computing platforms, more apps, and yet it still feels incomplete! Lots of competition, money spilling/grabbing and everyone wants a piece of it. It seems like the people are running faster and faster, day by day, and as they do so, the world to seems to be rotating faster to keep up the pace.

Almost every software organisation is now following Agile Software Development. When there’s agility, there’s cursing, blaming and hatred. People work in groups for building a software piece, a bunch of the guys collaborate to work on the same codebase. As they do so, one may mess up the work that the other guy in the group had done, it repeats several times and then emerges cold war (or even worse) amongst the group. This is where we make a switch to better code versioning/revisioning.

Git is a code revisioning tool/system which lets every person in a team to have their own copy of the code locally. So when changes are made to he code by one person, it doesn’t affect the others. The person can continue working on his local copy smoothly and when everyone’s code has to be put together, they all sit together, discuss and merge their code peacefully.

git init
echo "test" &amp;amp;gt; hello.js
git add .
git commit -m "initial commit"
git push

Navigate to your project directory and initialize it for git:

cd /path/to/project/directory
git init

Now your project directory has been initialized, which means, you can now perform git operations on this directory.

If the project directory is empty and you haven’t added any code, you could add them first. If you already have added your project files, you can now add each of them for versioning:

 git add . 

By doing this, you’re telling git to add all the files of the directory for versioning. Git can then maintain revisions of each file.

Desktop Background Slideshow in Windows XP

You might have used the Windows 7’s Desktop Background Slideshow feature. By selecting the images, we can have the slideshow of those images set as desktop background, i.e., we can see the background image changing at regular intervals.
For those of you who are still running the old Windows builds like Windows XP, you don’t get the Slideshow feature. If you’d like to have a dynamic, ever-cycling wallpapers, here’s what you have to do:
Step 1: Create a directory named Autobg in c:\ (Drive C) and add some pictures into it.
Note that the images must be in Bitmap image format (.bmp extension) as the Windows operating system handles bitmap digital images to be set as desktop background.

Step 2: Now, you just have to write a code in any programming language to create a delay in displaying the images. The easiest and familiar one is ‘C’. Compile the following C code in a C-compiler to generate a executable file (.exe) which can be run on Windows platforms. Here’s the code:

void main()
{
    sleep(5); //The delay of 5 seconds. You can even give the desired value.
}

Save the code file as sleep.c and compile the code. Now the executable file sleep.exe is generated. Add this file to the directory you created in the Step 1, i.e., into c:\Autobg\

Step 3: Now, open a text editor and type the following Script:

@ECHO OFF
CLS
cd c:\Autobg
dir /B /O *.bmp &gt; c:\Autobg\pics.txt
:loop
FOR /F "eol=;" %%i in (pics.txt) do (
sleep
REG ADD "HKCU\Control Panel\Desktop" /v Wallpaper /t REG_SZ /d "c:\Autobg\%%i" /f NULL
rundll32.exe user32.dll,UpdatePerUserSystemParameters )
goto loop

Save the script file as AutoBGChange.bat
You have to convert the .bat file to .exe file.
Click here to download the converter.
Run Bat_To_Exe_Converter.exefrom the downloaded archive.
In the Browse file field, browse and add the batch file AutoBGChange.bat
In the Save as field, type c:\Autobg\AutoBGChange.exe
Change the Visibility option to Invisible application and press the Compile button to generate the .exe fiile into the c:\Autobg\ directory.

Step 4: Open the text editor again and type the following Script:

REG ADD "HKLM\SOFTWARE\Microsoft\Windows\CurrentVersion\Run" /v AutoBG /t REG_SZ /d "c:\Autobg\AutoBGChange.exe"  /f &gt;NULL
c:\Autobg\AutoBGChange.exe
exit

Save the script file as Run.bat
You have to convert the .bat file to .exe file as you did in the previous step and save the .exe file as Run.exe into the c:\Autobg\ directory.

Step 5: Open command prompt and type c:\Autobg\Run.exe

Boom! Its done. Now you can see the desktop background change for every 5 seconds! If you want to add other images to the slideshow, convert the images to .bmp format and add them to c:\Autobg\ directory.