Post

Improve handling of older documents by OCR and AI

JabRef, a comprehensive literature management software, currently supports both handling metadata and text-based PDF documents. However, a significant limitation arises with scanned PDFs, particularly historical articles, which are not text-searchable due to their image-based format. This project aims to bridge this gap by integrating advanced OCR (Optical Character Recognition) technology, enabling full-text search in scanned PDFs.

Useful links:

Some aspects:

  1. Add an option to call an OCR engine from JabRef, e.g., cloud based or local installs
  2. Define a common interface to support multiple OCR engines
  3. Provide a good default set of settings for the OCR engines
  4. Support expert configuration of the settings
  5. Add the extracted text as a layer to the pdf so that Apache Lucene can parse it
  6. Add an option to further process the text with Grobid for training and metadata extraction

Expected outcome:

A) Develop a common interface within JabRef to accommodate multiple OCR engines, ensuring flexibility and expandability. B) Enable expert users to fine-tune OCR settings, catering to specific needs or document formats.
C) Incorporate the OCR-extracted text as a searchable layer in PDFs, allowing Apache Lucene to index and look for the content.

Skills required:

  • Proficiency in Java programming.
  • A keen interest and curiosity in document processing and AI technologies.

Possible mentors:

@Siedlerchr, @InAnYan, @calixtus, @subhramit

Project size:

90h (small)

This post is licensed under CC BY 4.0 by the author.