Improve handling of older documents by OCR and AI
JabRef, a comprehensive literature management software, currently supports both handling metadata and text-based PDF documents. However, a significant limitation arises with scanned PDFs, particularly historical articles, which are not text-searchable due to their image-based format. This project aims to bridge this gap by integrating advanced OCR (Optical Character Recognition) technology, enabling full-text search in scanned PDFs.
Useful links:
- OCR Integration in JabRef - Meta Issue
- A Document AI Package: https://github.com/deepdoctection/deepdoctection
- Hand-written text recognition in historical documents: https://github.com/githubharald/SimpleHTR#handwritten-text-recognition-with-tensorflow
- Java OCR with Tesseract: Baeldung Guide
- Tesseract OCR Library: Official Documentation
- OCRmyPDF Installation and Usage: GitHub Repository
- ChatOCR and ChatGPT Integration: Blog Article
- AI-Powered OCR: Addepto Blog
- Tika OCR Integration: Apache Tika Wiki
- Surya AI powered OCR, apparently better than Tesseract, but coded in python VikParuchuri/Surya
- SOTA (October 2025) language model for OCR: PaddleOCR-VL; Supported by llama.cpp with PR 16701
Some aspects:
- Add an option to call an OCR engine from JabRef, e.g., cloud based or local installs
- Define a common interface to support multiple OCR engines
- Provide a good default set of settings for the OCR engines
- Support expert configuration of the settings
- Add the extracted text as a layer to the pdf so that Apache Lucene can parse it
- Add an option to further process the text with Grobid for training and metadata extraction
Expected outcome:
A) Develop a common interface within JabRef to accommodate multiple OCR engines, ensuring flexibility and expandability. B) Enable expert users to fine-tune OCR settings, catering to specific needs or document formats.
C) Incorporate the OCR-extracted text as a searchable layer in PDFs, allowing Apache Lucene to index and look for the content.
Skills required:
- Proficiency in Java programming.
- A keen interest and curiosity in document processing and AI technologies.
Possible mentors:
@Siedlerchr, @InAnYan, @calixtus, @subhramit
Project size:
90h (small)