Here is a little script that I employed to get pyspark running on our cluster. Why is this necessary? Well, if you want to use the ML libraries within Apache Spark from the Python API, you need Python 2.7. However, in case your cluster runs on CentOS, it comes with Python 2.6 due to dependencies. DO NOT REMOVE IT. Otherwise bad things will happen.
Instead, it's best practice to have a separate Python 2.7 installation. And to be completely isolated, best practice is to create a virtualenv, which you will use to install all packages you are going to use with pyspark.
Also, if you plan to run pyspark within Zeppelin, you have to be sure that the virtualenv is accessible to user Zeppelin. This is why I install the whole thing in /etc. Also, make sure to run this on all cluster nodes, otherwise Spark executors cannot launch the local Python processes.
# run as root
# info on python2.7 req's here: http://toomuchdata.com/2014/02/16/how-to-install-python-on-centos/
# info on installing python for spark: http://blog.cloudera.com/blog/2015/09/how-to-prepare-your-apache-hadoop-cluster-for-pyspark-jobs/
# info on python on local environment http://stackoverflow.com/questions/5506110/is-it-possible-to-install-another-version-of-python-to-virtualenv
After you did this, make sure to set variable PYSPARK_PYTHON in /etc/spark-env.sh to the path of the new binary, in this case /etc/spark-python/spark-venv-py2.7/bin/python
Also, if you use Zeppelin make sure to set the correct python path in interpreter settings. Simply alter/add property zeppelin.pyspark.python and set it's value to the python binary as above.