Success stories

MASS TEXT ANALYSIS

BUSINESS QUESTION

As part of a large engineering IS redesign program, a standard referential must be set up. It will enable engineers to quickly identify qualified standards and avoid “reinventing the wheel”. The objective of the project was to extract information from 8 000 text documents describing Standards, in multiple format (legacy/low quality scanned pdf, structured pdf and word)

DATA

  • 8 000 text documents describing Standards, in multiple format (legacy/low quality scanned pdf, structured pdf and word)
  • Each document contains multipel complex tables, images and a standard designation
  • A reference but partial Standards list was available to benchmark automatic extraction accuracy

APPROACH

  • Convert all document to XML format through intermediate word conversion or Optical Character Recognition
  • Extract images and tables from the XML, including title when available; filter out irrelevant images such as logos
  • Stack split tables through headers matching and generate output files in xls format (+ 20 000 files)
  • Read the designation, link each item to tables in the document and automatically compute the list of possible Standards codes
  • Benchmark results with the available partial database, then iterate to improve accuracy both and coverage

RESULTS

  • Automatic data extraction is possible even from very unstructured documents (old scanned pdf), leveraging solely open source tools (Python and appropriate libraries)
  • There are some limitation on documents with the lowest quality
  • Computing a comprehensive list of Standards codes from automated designation parsing is a challenge and was only achievable on 1/3 of the documents