MASS TEXT ANALYSIS
BUSINESS QUESTION
As part of a large engineering IS redesign program, a standard referential must be set up. It will enable engineers to quickly identify qualified standards and avoid “reinventing the wheel”. The objective of the project was to extract information from 8 000 text documents describing Standards, in multiple format (legacy/low quality scanned pdf, structured pdf and word)
DATA
- 8 000 text documents describing Standards, in multiple format (legacy/low quality scanned pdf, structured pdf and word)
- Each document contains multipel complex tables, images and a standard designation
- A reference but partial Standards list was available to benchmark automatic extraction accuracy
APPROACH
- Convert all document to XML format through intermediate word conversion or Optical Character Recognition
- Extract images and tables from the XML, including title when available; filter out irrelevant images such as logos
- Stack split tables through headers matching and generate output files in xls format (+ 20 000 files)
- Read the designation, link each item to tables in the document and automatically compute the list of possible Standards codes
- Benchmark results with the available partial database, then iterate to improve accuracy both and coverage
RESULTS
- Automatic data extraction is possible even from very unstructured documents (old scanned pdf), leveraging solely open source tools (Python and appropriate libraries)
- There are some limitation on documents with the lowest quality
- Computing a comprehensive list of Standards codes from automated designation parsing is a challenge and was only achievable on 1/3 of the documents