Home > Articles

Installing Spark

  • Print
  • + Share This
This chapter is from the book

Exploring the Spark Install

Now that you have Spark up and running, let’s take a closer look at the install and its various components.

If you followed the instructions in the previous section, “Installing Spark in Standalone Mode,” you should be able to browse the contents of $SPARK_HOME.

In Table 3.1, I describe each subdirectory of the Spark installation.

TABLE 3.1 Spark Installation Subdirectories

Directory

Description

bin

Contains all of the commands/scripts to run Spark applications interactively
through shell programs such as pyspark, spark-shell, spark-sql and
sparkR, or in batch mode using spark-submit.

conf

Contains templates for Spark configuration files, which can be used to set Spark
environment variables (spark-env.sh) or set default master, slave, or client
configuration parameters (spark-defaults.conf). There are also configuration
templates to control logging (log4j.properties), metrics collection (metrics.
properties
), as well as a template for the slaves file, which controls which
slave nodes can join the Spark cluster.

ec2

Contains scripts to deploy Spark nodes and clusters on Amazon Web Services
(AWS) Elastic Compute Cloud (EC2). I will cover deploying Spark in EC2 in
Hour 5, “Deploying Spark in the Cloud.„

lib

Contains the main assemblies for Spark including the main library
(spark-assembly-x.x.x-hadoopx.x.x.jar) and included example programs
(spark-examples-x.x.x-hadoopx.x.x.jar), of which we have already run
one, SparkPi, to verify the installation in the previous section.

licenses

Includes license files covering other included projects such as Scala and JQuery.
These files are for legal compliance purposes only and are not required to
run Spark.

python

Contains all of the Python libraries required to run PySpark. You will generally not
need to access these files directly.

sbin

Contains administrative scripts to start and stop master and slave services
( locally or remotely) as well as start processes related to YARN and Mesos.
I used the start-master.sh and start-slave.sh scripts when I covered how
to install a multi-node cluster in the previous section.

data

Contains sample data sets used for testing mllib (which we will discuss in more
detail in Hour 16, “Machine Learning with Spark„).

examples

Contains the source code for all of the examples included in
lib/spark-examples-x.x.x-hadoopx.x.x.jar. Example programs are
included in Java, Python, R, and Scala. You can also find the latest code for the
included examples at https://github.com/apache/spark/tree/master/examples.

R

Contains the SparkR package and associated libraries and documentation.
I will discuss SparkR in Hour 15, “Getting Started with Spark and R„

  • + Share This
  • 🔖 Save To Your Account