Using pdffigure2 to extract images from publications
, 06 Nov 2016So Allen Institute for Artificial Intelligence has a really useful tool to extract images from the paper called allenai/pdffigures2. I found it really handy to use. Basically, we just need to clone the repository
git clone https://github.com/allenai/pdffigures2
And then we can use sbt
as main command to extract images. We can run the following
line in pdffigures2
directly.
sbt "run-main org.allenai.pdffigures2.FigureExtractorBatchCli /path/to/pdf_directory/ -s stat_file.json -m /figure/image/output/prefix -d /figure/data/output/prefix"
Now, we can also create simple Python client in order to use these function too. We can call these function directly from the directory that we clone the repository.
import os
from glob import glob
import subprocess
def list_pdf(directory):
"""
List all pdf file path from given directory
"""
return glob(os.path.join(directory, '*.pdf'))
def pdffigure2(input_path, output_path):
"""
Python client for pdffigure2
"""
input_path = os.path.expanduser(input_path)
output_path = os.path.expanduser(output_path)
if not os.path.isdir(output_path):
os.mkdir(output_path)
command = ' '.join(['run-main', 'org.allenai.pdffigures2.FigureExtractorBatchCli', input_path, '-m', output_path, '-d', output_path])
sbt_command = ' '.join(['sbt', '"' + command + '"'])
output, error = subprocess.Popen([sbt_command],
shell=True, universal_newlines=True).communicate()
print('finish extracting %s to %s' % (input_path, output_path))
The usage is really simple, just list all pdf using glob
and then run the sbt
sequentially to extract all images from pdf files.
pdf_paths = list_pdf('~/Desktop/pdf/')
for pdf_path in pdf_paths:
pdffigure2(pdf_path, '~/Desktop/output/')
or just provide path to pdf folder, in this case, it’s simply
pdffigure2('~/Desktop/pdf/', '~/Desktop/output/')