— 2 min read

Spark context manager

a context manager to communicate with pyspark.

Spark is data processing and clustering framework that allows developers to create parallel apps in Scala, Java, and Python.

For Python developers, Spark exposes it’s programming model through a Python API (pyspark). And in order to communicate with Spark from a python application, let’s say django, we need context during this communication.

First step check that SPARK_HOME is set, otherwise you need to do it.

If you are using a virtualenv for you python project, you need to add spark to this virtualenv:

1 >  add2virtualenv path-to-spark/python
2 >  add2virtualenv path-to-spark/python/lib/py4j-0.8.2.1-src.zip

This will add the spark to your virtualenv python path. You can check that _virtualenv_path_extensions.pth contains this new entries.

1 >  path-to-virtualenv/lib/python2.7/site-packages/_virtualenv_path_extensions.pth

Now we are ready to add a spark context manager:

 1 import contextlib
 2 
 3 from pyspark import SparkContext, SparkConf
 4 
 5 SPARK_MASTER = 'local'
 6 SPARK_APP_NAME = 'app'
 7 
 8 @contextlib.contextmanager
 9 def spark_manager(spark_master=SPARK_MASTER,
10                   spark_app_name=SPARK_APP_NAME,
11                   spark_executor_memory=None):
12 
13     conf = SparkConf().setMaster(spark_master)
14     conf.setAppName(spark_app_name)
15 
16     if spark_executor_memory:
17         conf.set("spark.executor.memory", spark_executor_memory)
18 
19     spark_context = SparkContext(conf=conf)
20 
21     try:
22         yield spark_context
23     finally:
24         spark_context.stop()

And you can use this context manager for spark related jobs, you only need to wrap in a with statement.

1 with spark_manager() as context:
2     l = context.parallelize([1, 3, 4, 5]).collect()