Setting up Pubmed Parser with PySpark
Below, we put a small snippet to setup Spark 2.1
on Jupyter Notebook.
We can use PySpark as a workflow to process MEDLINE XML data to Spark dataframe.
Using PySpark can reduce parsing time of more than 25 million documents to less than 10 minutes when you have multiple core processors.
Note that the spark_home
path to downloaded Spark might be different.
import os
import findspark
findspark.init(spark_home="/opt/spark-2.1.0-bin-cdh5.9.0/")
In Spark 2.1, spark
in this case can use as sparkContext
which has access to parallelize
or createDataFrame
.
from pyspark.sql import SparkSession
from pyspark.conf import SparkConf
conf = SparkConf().\
setAppName('map').\
setMaster('local[5]').\
set('spark.yarn.appMasterEnv.PYSPARK_PYTHON', '~/anaconda3/bin/python').\
set('spark.yarn.appMasterEnv.PYSPARK_DRIVER_PYTHON', '~/anaconda3/bin/python').\
set('executor.memory', '8g').\
set('spark.yarn.executor.memoryOverhead', '16g').\
set('spark.sql.codegen', 'true').\
set('spark.yarn.executor.memory', '16g').\
set('yarn.scheduler.minimum-allocation-mb', '500m').\
set('spark.dynamicAllocation.maxExecutors', '3').\
set('spark.driver.maxResultSize', '0')
spark = SparkSession.builder.\
appName("testing").\
config(conf=conf).\
getOrCreate()
Please see the full implementation details in scripts folder on the repository.
We will update the documentation on how to incorporate Pubmed Parser with dask soon.