Install Tesseract OCR 5 on Rocky Linux 8|AlmaLinux 8
source link: https://computingforgeeks.com/install-tesseract-ocr-on-rocky-almalinux/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
In day-to-day activities, we encounter situations where we need to extract text from documents. This is where OCR software comes in. OCR (Optical Character Recognition) is a technology used to recognize the text in images or scanned documents. Tesseract is one of the software in this category.
Tesseract is a free and open-source OCR originally developed by Hewlett-Packard. I was later open-sourced by HP in 2005 and developed by Google since 2006. The current stable version of Tesseract 5.-0 was released on November 30, 2021. This release has several improvements and bug fixes. Some of the features of Tesseract OCR 5 include:
- Added support for Apple Silicon
- Low memory consumption as compared to earlier versions
- Support for using floats for LSTM model training and text recognition.
This guide offers a deep illustration of how to install Tesseract OCR 5 on Rocky Linux 8|AlmaLinux 8.
Let’s dive in!
Install Tesseract OCR 5 on Rocky Linux 8|AlmaLinux 8
There are a couple of methods one can use to install Tesseract OCR 5 on Rocky Linux 8|AlmaLinux 8. In this guide, we will cover the two methods below:
- Building from Source
- Using Docker/Podman Containers
Method 1 – Install Tesseract OCR 5 on Rocky Linux 8|AlmaLinux 8 from Source
In this step, we will start by installing the required packages to build Tesseract OCR 5 from a source file.
sudo yum install git automake make autoconf libtool clang gcc-c++.x86_64
There are more leptonica dependencies required.
sudo yum install zlib zlib-devel libjpeg libjpeg-devel libwebp libwebp-devel libtiff libtiff-devel libpng libpng-devel
Move the executables to your path.
cd /usr/local/lib
sudo cp /usr/lib64/libjpeg.so.62 .
sudo cp /usr/lib64/libwebp.so.7 .
sudo cp /usr/lib64/libtiff.so.5 .
sudo cp /usr/lib64/libpng16.so.16 .
Compile Leptonica from source. Begin by cloning it from git as below.
cd ~
git clone https://github.com/DanBloomberg/leptonica.git --depth 1
cd leptonica
Now compile and install Leptonica.
./autogen.sh
./configure --prefix=/usr/local --disable-shared --enable-static --with-zlib --with-jpeg --with-libwebp --with-libtiff --with-libpng --disable-dependency-tracking
make
sudo make install
sudo ldconfig
Once Leptonica is installed, we will proceed and download Tesseract OCR 5 from Github. You can also pull the file using Wget:
cd ~
VER=$(curl -s https://api.github.com/repos/tesseract-ocr/tesseract/releases/latest|grep tag_name | cut -d '"' -f 4)
wget https://github.com/tesseract-ocr/tesseract/archive/refs/tags/$VER.tar.gz -O tesseract-5.tar.gz
Extract the archive.
tar zxvf tesseract-5.tar.gz
Navigate into the extracted directory.
cd tesseract-*/
Now compile Tesseract OCR 5.
export PKG_CONFIG_PATH=/usr/local/lib/pkgconfig
./autogen.sh
./configure --prefix=/usr/local --disable-shared --enable-static --with-extra-libraries=/usr/local/lib/ --with-extra-includes=/usr/local/lib/
make
Install Tesseract OCR 5 using the command:
sudo make install
sudo ldconfig
Load tesseract languages
Load tesseract languages by creating a language path.
mkdir -p /tess/traineddata
Export the Tesseract path by adding the below line to ~/.bashrc.
export TESSDATA_PREFIX=/home/$USER/tess/traineddata
You can replace $USER with the exact username on the system
Source the profile.
source ~/.bashrc
Now add any trained data available on Github tessdata to the path.
cd $TESSDATA_PREFIX
wget https://github.com/tesseract-ocr/tessdata/raw/main/eng.traineddata
wget https://github.com/tesseract-ocr/tessdata/raw/main/fra.traineddata
Method 2 – Install Tesseract OCR 5 on Rocky Linux 8|AlmaLinux 8 Using Docker/Podman
I assume that you already have Docker/Podman Engine installed on your system. Otherwise, install Docker Engine on Rocky Linux 8|AlmaLinux 8 using the guide below:
Podman can be installed using the command below:
sudo yum install podman
Once the preferred container engine has been installed, create the Tesseract OCR 5 docker file.
vim Dockerfile
In the file, add the lines below
FROM ubuntu
RUN apt-get update -y
ENV DEBIAN_FRONTEND=noninteractive
RUN apt-get install -y gnupg apt-transport-https apt-utils wget
RUN echo "deb https://notesalexp.org/tesseract-ocr5/focal/ focal main" \
|tee /etc/apt/sources.list.d/notesalexp.list > /dev/null
RUN wget -O - https://notesalexp.org/debian/alexp_key.asc | apt-key add -
RUN apt-get update -y
RUN apt-get install tesseract-ocr -y
RUN apt install imagemagick -y
ENTRYPOINT ["tesseract"]
RUN tesseract -v
Save the file and build the Tesseract 5 image:
$ docker build . -t tesseract:5
##OR
$ podman build . -t tesseract:5
Once build you will see this
Removing intermediate container a9ecd5a7810f
---> e1f76f250bc1
Step 10/11 : ENTRYPOINT ["tesseract"]
---> Running in 1886dfbcfd5e
Removing intermediate container 1886dfbcfd5e
---> 30afc1531eb9
Step 11/11 : RUN tesseract -v
---> Running in 598972cbd362
tesseract 5.0.1
leptonica-1.79.0
libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 2.0.3) : libpng 1.6.37 : libtiff 4.1.0 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.1
Found AVX2
Found AVX
Found FMA
Found SSE4.1
Found OpenMP 201511
Found libarchive 3.4.0 zlib/1.2.11 liblzma/5.2.4 bz2lib/1.0.8 liblz4/1.9.2 libzstd/1.4.4
Found libcurl/7.68.0 OpenSSL/1.1.1f zlib/1.2.11 brotli/1.0.7 libidn2/2.2.0 libpsl/0.21.0 (+libidn2/2.2.0) libssh/0.9.3/openssl/zlib nghttp2/1.40.0 librtmp/2.3
Removing intermediate container 598972cbd362
---> 35f8f89ea444
Successfully built 35f8f89ea444
Successfully tagged tesseract:5
Check the image:
##For Docker
$ docker images
REPOSITORY TAG IMAGE ID CREATED SIZE
tesseract 5 35f8f89ea444 14 seconds ago 302MB
##For Podman
$ podman images
REPOSITORY TAG IMAGE ID CREATED SIZE
localhost/tesseract 5 f92efe531f58 2 minutes ago 309 MB
With the Tesseract image, you can now run Tesseract using the syntax.
##For Docker
docker run --rm -v /path/to/image/:/tmp/:z tesseract:5 /tmp/img.jpg stdout
##For Podman
podman run --rm -v /path/to/image/:/tmp/:z localhost/tesseract:5 /tmp/img.jpg stdout
In the above command:
- –rm removes the conatiner after running the above command
- –v specifies the volume with /path/to/image/ as the local path to be mounted on /tmp/ of the conatiner and /tmp/img.jpg is the exact location of the image file the mounted local path.
- tesseract:5 is the conatiner image created
- z modifies selinux label. It indicates that the bind mount content is shared among multiple containers
Use Tesseract OCR 5 on Rocky Linux 8|AlmaLinux 8.
Once Tesseract OCR 5 has been installed, you can now start extracting text from scanned documents or images.
To convert an image to a text file, the syntax used is as below.
tesseract <image_name> <output file_name>
For example, converting an image to a text file the command will be:
tesseract image.png new
The output will be a text file- new of the image file- image.png.
For Docker/Podman, the command will be as below assuming the image file image.png is at /home/thor/Desktop/image.png
##For Docker
docker run --rm -v /home/thor/Desktop/:/tmp/:z tesseract:5 /tmp/image.png stdout
##For Podman
podman run --rm -v /home/thor/Desktop/:/tmp/:z localhost/tesseract:5 /tmp/image.png stdout
Or alternatively, shorten the command as below(Example with Podman) after navigating into the directory with the image file.
podman run --rm -v $(pwd)/:/tmp/:z localhost/tesseract:5 /tmp/image.png stdout
This is the text output of the image.png in the directory
You can also have the output above to a file say new.txt as below.
tesseract image.png new
Example for Docker:
docker run --rm -v $(pwd)/:/tmp/:z tesseract:5 /tmp/image.png /tmp/new
Or to a PDF file(new.pdf) as below.
tesseract image.png new pdf
Example for docker
docker run --rm -v $(pwd)/:/tmp/:z tesseract:5 /tmp/image.png /tmp/new pdf
Specify the language.
When using Tesseract OCR you can specify the language you want to use with the -l flag. For example, use Czech.
tesseract image.png new -l ces
You can specify multiple languages as well.
tesseract image.png new -l ces+eng
The end!
Closing Thoughts.
We have triumphantly installed and used Tesseract OCR 5 on Rocky Linux 8|AlmaLinux 8. You can now extract text from images and scanned documents easily. I hope this was helpful.
See more:
Recommend
About Joyk
Aggregate valuable and interesting links.
Joyk means Joy of geeK