Friday, August 26, 2016

An Apache Spark pyspark setup script, incl. virtualenv

Here is a little script that I employed to get pyspark running on our cluster. Why is this necessary? Well, if you want to use the ML libraries within Apache Spark from the Python API, you need Python 2.7. However, in case your cluster runs on CentOS, it comes with Python 2.6 due to dependencies. DO NOT REMOVE IT. Otherwise bad things will happen.

Instead, it's best practice to have a separate Python 2.7 installation. And to be completely isolated, best practice is to create a virtualenv, which you will use to install all packages you are going to use with pyspark.

Also, if you plan to run pyspark within Zeppelin, you have to be sure that the virtualenv is accessible to user Zeppelin. This is why I install the whole thing in /etc. Also, make sure to run this on all cluster nodes, otherwise Spark executors cannot launch the local Python processes.

# run as root

# info on python2.7 req's here:
# info on installing python for spark:
# info on python on local environment

#install needed system libraries
yum groupinstall "Development tools"
yum install zlib-devel bzip2-devel openssl-devel ncurses-devel sqlite-devel readline-devel tk-devel gdbm-devel db4-devel libpcap-devel xz-devel

#setup local python2.7 installation
mkdir /etc/spark-python
mkdir /etc/spark-python/python
cd /etc/spark-python/python
tar -zxvf Python-2.7.9.tgz
cd Python-2.7.9

make clean
./configure --prefix=/etc/spark-python/.localpython
make install

#setup local pip installation
cd /etc/spark-python/python

tar -zxvf virtualenv-15.0.3.tar.gz
cd virtualenv-15.0.3/
/etc/spark-python/.localpython/bin/python install

cd /etc/spark-python
/etc/spark-python/.localpython/bin/virtualenv spark-venv-py2.7 --python=/etc/spark-python/.localpython/bin/python2.7

#activate venv
cd /etc/spark-python/spark-venv-py2.7/bin
source ./activate

#pip install packages of your choice
/etc/spark-python/spark-venv-py2.7/bin/pip install  --upgrade pip
/etc/spark-python/spark-venv-py2.7/bin/pip install py4j
/etc/spark-python/spark-venv-py2.7/bin/pip install numpy
/etc/spark-python/spark-venv-py2.7/bin/pip install scipy
/etc/spark-python/spark-venv-py2.7/bin/pip install scikit-learn
/etc/spark-python/spark-venv-py2.7/bin/pip install pandas

After you did this, make sure to set variable PYSPARK_PYTHON in /etc/ to the path of the new binary, in this case /etc/spark-python/spark-venv-py2.7/bin/python

Also, if you use Zeppelin make sure to set the correct python path in interpreter settings. Simply alter/add property zeppelin.pyspark.python and set it's value to the python binary as above.

Tags: Apache Spark, Python, pyspark, Apache Zeppelin, Ambari, Hortonworks HDP

No comments:

Post a Comment