ИнфоРост
информационные технологии для архивов и библиотек

Automated data extraction, term recognition


[2016 Feb | A&A List] 
 
Bil Underwood at the Georgia Tech Research Institute had been doing investigation into Named Entity Recognition and metadata extraction for NARA. The project was called PERPOS, and their white papers and reports are available at http://perpos.gtri.gatech.edu/publications/

The tool GTRI was using for Natural Language Processing was GATE, which an open source project (developed in Java) based at Sheffield. They have a set of workshop materials online at https://gate.ac.uk/wiki/TrainingCourseJune2015/, and it's possible to work your way through these on your own to teach yourself how to use it, though most advanced work requires some programming skills.

The equivalent to GATE in Python is NLTK. They don't have a set of workshops, but there is a good book online that can be used to teach yourself the tools. See http://www.nltk.org/ and http://www.nltk.org/book/ (don't buy the print edition until they've finished the revision).