Automated PDF data extraction solutions come in different flavours, ranging from simple OCR tools to enterprise ready document processing and workflow automation platforms. Most systems share however a similar workflow:

1. Assemble batches of samples documents which acts as training data
2. Train the system for each type of document you want to process
3. Set up a process to automatically fetch documents, process them and dispatch the data

Most advanced solutions use a combination of different techniques to train the data extraction system. A simple method is for example Zonal OCR where the user simply defines specific locations inside the document with a point & click system. More advanced techniques are based on regular expressions and pattern recognition.

  • Layout-aware text extraction from full-text PDF of scientific articles
  • PDFMiner is a tool for extracting information from PDF documents. Unlike other PDF-related tools, it focuses entirely on getting and analyzing text data. PDFMiner allows one to obtain the exact location of text in a page, as well as other information such as fonts or lines. It includes a PDF converter that can transform PDF files into other text formats (such as HTML). It has an extensible PDF parser that can be used for other purposes than text analysis.
  • Pdf2text.java