Stanford Core NLP

Stanford Core NLP, 02 Mar 2016

I would like to use Stanford Core NLP (on EC2 Ubuntu instance) for multiple of my text preprocessing which includes Core NLP, Named Entiry Recognizer (NER) and Open IE. Basically I want to create server and can be able to query it with Python easily.

I haven’t done all the installation process yet. However, I want to put everything in one place so I can come back later to update this post.

First, they require Java 1.8 or later. We can do as follows

sudo add-apt-repository ppa:webupd8team/java -y
sudo apt-get update
sudo apt-get install oracle-java8-installer
sudo apt-get install oracle-java8-set-default

Then I can add export JAVA_HOME=/usr/lib/jvm/java-8-oracle into my .bash_profile. Now, when I type java -version, it should return version 1.8

Now, we just have to download Core NLP, Named Entiry Recognizer (NER) and Open IE .jar file. I simply put every jar file in one place.

Now, I have to start the server (the full documentation is at this page) as an API so I can query from this server later on. I use screen to run the server in case I want to exit from my EC2 instance.

screen # start screen
export CLASSPATH="`find . -name '*.jar'`"
java -mx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer [port?] # run server

Then in screen, I can use ctrl+a and d in order to detach from my screen (simply use screen -r <port_name> to access the opening screen, screen -ls to list). Then after running the server, I can simply go to <ec2_ip>:9000 in order to test the server.

Now, we have to just find the right Python wrapper for CoreNLP. There are a bunch including dasmith/stanford-corenlp-python, smilli/py-corenlp, dat/pyner and more.

In my opinion, smilli/py-corenlp is one of the easiest Python library to use. You can install using pip install pycorenlp. An example usage is as below:

from pycorenlp import StanfordCoreNLP
nlp = StanfordCoreNLP('http://localhost:9000')
output = nlp.annotate('Department of Radiation Oncology, Stanford University, Aviano, PN, Italy',
    properties={
      'annotators': 'ner',
      'outputFormat': 'json'}
    )

where we can find all support annotators from this page or from this page. We can state multiple annotators by just using commas e.g. ner,openie. For output format, it can be json, xml, text or serialized (see more from CoreNLP server page).

So yeah, at the end, it’s not that hard to run Stanford CoreNLP server and annotate some text that we have!

Install on Mac OSX

I did similar to what I did on Amazon EC2. However, the process is less complicated. I just have to download the file from CoreNLP page. Then running the same way as on Amazon EC2 Ubuntu.

screen # start screen
export CLASSPATH="`find . -name '*.jar'`"
java -mx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer [port?] # run server