By Manny Fernandez

January 5, 2026

Searching .jpg and .pdf Files with OCR and grep

JPG

When I do my expenses, I usually have a bunch of .jpg files.  I wanted to be able to scan the .jpg but wanted to search for particular content (e.g. $89.33)

I created a python script that uses tesseract to OCR the .jpg and search.

Tesseract is the core optical character recognition (OCR) engine, while Pytesseract is a Python wrapper that provides an interface to run the Tesseract engine from within Python code. You cannot use Pytesseract without having the Tesseract engine installed separately on your system.  Below is based on macOS running Homebrew.

Installing Requirements 

brew install tesseract

python3 -m pip install pytesseract pillow

Running the Script

The script will require a couple of things; 1st is the text you are searching for and the 2nd is the location of the files.

  1. Here it imports the pytesseract
  2. The location of your tesseract
  3. The location of the files.  Since I have multiple computers, the path are a little different.
  4. What we are looking for.

When the script is run, it will scan the files are return the name of the files where it found the keyword we are searching for.

The way I run it is that I have the script open in Sublime Text.  I had two folders so I would keep the script open, make mods to path (which folder to search) and what to actually search.

PDFs

For PDFs, I use pdfgrep which is an Open Source project.  The use case is that after I am done filling out a report, I can save it as a PDF.  Sometimes, I find a receipt and want to make sure if I already expensed it or not.  Rather than openinging every PDF and searching it, I use pdfgrep.  To install on macOS (If you are running Homebrew)

brew install pdfgrep

To run pdfgrep you can either run it against a file or an entire directory.  In my workflow, I want to scan all the PDF’d reports to ensure I expensed a particular charge based on the amount.  In this example, I am searching for 51.96 which is the amount I want to check.

pdfgrep -ril "51.96" /Users/mannyfernadez/Desktop/Reports > matches.txt

-r –recursive
Recursively search all files (restricted by –include and –exclude) under each directory, following symlinks only if they are on the command line.

-i –ignore-case
Ignore case distinctions in both the PATTERN and the input files.

-l –files-with-matches
Suppress normal output. Instead print the name of each input file that contains a match. This works well with -Z, but many other output options like -n or -c are ignored when -l is specified.

This allows me to validate the expense by going directly to the file.

 

Recent posts

  • At its core, IEEE 802.1X is a network layer... Full Story

  • In case you did not see the previous FortiNAC... Full Story

  • This is our 5th session where we are going... Full Story

  • Now that we have Wireshark installed and somewhat configured,... Full Story

  • The Philosophy of Packet Analysis Troubleshooting isn't about looking... Full Story