Using pdffigure2 to extract images from publications, 06 Nov 2016

So Allen Institute for Artificial Intelligence has a really useful tool to extract images from the paper called allenai/pdffigures2. I found it really handy to use. Basically, we just need to clone the repository

git clone https://github.com/allenai/pdffigures2

And then we can use sbt as main command to extract images. We can run the following line in pdffigures2 directly.

sbt "run-main org.allenai.pdffigures2.FigureExtractorBatchCli /path/to/pdf_directory/ -s stat_file.json -m /figure/image/output/prefix -d /figure/data/output/prefix"

Now, we can also create simple Python client in order to use these function too. We can call these function directly from the directory that we clone the repository.

import os
from glob import glob
import subprocess

def list_pdf(directory):
    """
    List all pdf file path from given directory
    """
    return glob(os.path.join(directory, '*.pdf'))

def pdffigure2(input_path, output_path):
    """
    Python client for pdffigure2
    """
    input_path = os.path.expanduser(input_path)
    output_path = os.path.expanduser(output_path)
    if not os.path.isdir(output_path):
        os.mkdir(output_path)
    command = ' '.join(['run-main', 'org.allenai.pdffigures2.FigureExtractorBatchCli', input_path, '-m', output_path, '-d', output_path])
    sbt_command = ' '.join(['sbt', '"' + command + '"'])
    output, error = subprocess.Popen([sbt_command],
                                      shell=True, universal_newlines=True).communicate()
    print('finish extracting %s to %s' % (input_path, output_path))

The usage is really simple, just list all pdf using glob and then run the sbt sequentially to extract all images from pdf files.

pdf_paths = list_pdf('~/Desktop/pdf/')
for pdf_path in pdf_paths:
    pdffigure2(pdf_path, '~/Desktop/output/')

or just provide path to pdf folder, in this case, it’s simply

pdffigure2('~/Desktop/pdf/', '~/Desktop/output/')