Improve handling of older documents by OCR and AI

Posted Dec 26, 2025

1 min read

JabRef, comprehensive literature management software, currently supports both handling metadata and text-based PDF documents. However, a significant limitation arises with scanned PDFs, particularly historical articles, which are not text-searchable due to their image-based format. This project aims to bridge this gap by integrating advanced OCR (Optical Character Recognition) technology, enabling full-text search in scanned PDFs.

Useful links:

OCR Integration in JabRef - Meta Issue
A Document AI Package: https://github.com/deepdoctection/deepdoctection
Hand-written text recognition in historical documents: https://github.com/githubharald/SimpleHTR#handwritten-text-recognition-with-tensorflow
Java OCR with Tesseract: Baeldung Guide
Tesseract OCR Library: Official Documentation
OCRmyPDF Installation and Usage: GitHub Repository
ChatOCR and ChatGPT Integration: Blog Article
AI-Powered OCR: Addepto Blog
Tika OCR Integration: Apache Tika Wiki
Surya AI powered OCR, apparently better than Tesseract, but coded in python VikParuchuri/Surya
SOTA (October 2025) language model for OCR: PaddleOCR-VL; Supported by llama.cpp with PR 16701

Some aspects:

Add an option to call an OCR engine from JabRef, e.g., cloud based or local installs
Define a common interface to support multiple OCR engines
Provide a good default set of settings for the OCR engines
Support expert configuration of the settings
Add the extracted text as a layer to the pdf so that Apache Lucene can parse it
Add an option to further process the text with Grobid for training and metadata extraction

Expected outcome:

A) Develop a common interface within JabRef to accommodate multiple OCR engines, ensuring flexibility and expandability. B) Enable expert users to fine-tune OCR settings, catering to specific needs or document formats.
C) Incorporate the OCR-extracted text as a searchable layer in PDFs, allowing Apache Lucene to index and look for the content.

Skills required:

Proficiency in Java programming.
A keen interest and curiosity in document processing and AI technologies.

Possible mentors:

@Siedlerchr, @InAnYan, @calixtus, @subhramit

Project size:

90h (small)

gsoc2026

project-idea size-small

This post is licensed under CC BY 4.0 by the author.

Trending Tags