13

Install Tesseract OCR 5 on Rocky Linux 8|AlmaLinux 8

 2 years ago
source link: https://computingforgeeks.com/install-tesseract-ocr-on-rocky-almalinux/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
Install Tesseract OCR 5 on Rocky Linux 8|AlmaLinux 8

In day-to-day activities, we encounter situations where we need to extract text from documents. This is where OCR software comes in. OCR (Optical Character Recognition) is a technology used to recognize the text in images or scanned documents. Tesseract is one of the software in this category.

Tesseract is a free and open-source OCR originally developed by Hewlett-Packard. I was later open-sourced by HP in 2005 and developed by Google since 2006. The current stable version of Tesseract 5.-0 was released on November 30, 2021. This release has several improvements and bug fixes. Some of the features of Tesseract OCR 5 include:

  • Added support for Apple Silicon
  • Low memory consumption as compared to earlier versions
  • Support for using floats for LSTM model training and text recognition.

This guide offers a deep illustration of how to install Tesseract OCR 5 on Rocky Linux 8|AlmaLinux 8.

Let’s dive in!

Install Tesseract OCR 5 on Rocky Linux 8|AlmaLinux 8

There are a couple of methods one can use to install Tesseract OCR 5 on Rocky Linux 8|AlmaLinux 8. In this guide, we will cover the two methods below:

  • Building from Source
  • Using Docker/Podman Containers

Method 1 – Install Tesseract OCR 5 on Rocky Linux 8|AlmaLinux 8 from Source

In this step, we will start by installing the required packages to build Tesseract OCR 5 from a source file.

sudo yum install git automake make autoconf libtool clang gcc-c++.x86_64

There are more leptonica dependencies required.

sudo yum install zlib zlib-devel libjpeg libjpeg-devel libwebp libwebp-devel libtiff libtiff-devel libpng libpng-devel

Move the executables to your path.

cd /usr/local/lib
sudo cp /usr/lib64/libjpeg.so.62 .
sudo cp /usr/lib64/libwebp.so.7 .
sudo cp /usr/lib64/libtiff.so.5 .
sudo cp /usr/lib64/libpng16.so.16 .

Compile Leptonica from source. Begin by cloning it from git as below.

cd ~
git clone https://github.com/DanBloomberg/leptonica.git --depth 1
cd leptonica

Now compile and install Leptonica.

./autogen.sh
./configure --prefix=/usr/local --disable-shared --enable-static --with-zlib --with-jpeg --with-libwebp  --with-libtiff --with-libpng --disable-dependency-tracking
make
sudo make install
sudo ldconfig

Once Leptonica is installed, we will proceed and download Tesseract OCR 5 from Github. You can also pull the file using Wget:

cd ~
VER=$(curl -s https://api.github.com/repos/tesseract-ocr/tesseract/releases/latest|grep tag_name | cut -d '"' -f 4)
wget https://github.com/tesseract-ocr/tesseract/archive/refs/tags/$VER.tar.gz -O tesseract-5.tar.gz

Extract the archive.

tar zxvf tesseract-5.tar.gz

Navigate into the extracted directory.

cd tesseract-*/

Now compile Tesseract OCR 5.

export PKG_CONFIG_PATH=/usr/local/lib/pkgconfig
./autogen.sh
./configure --prefix=/usr/local --disable-shared --enable-static --with-extra-libraries=/usr/local/lib/ --with-extra-includes=/usr/local/lib/
make

Install Tesseract OCR 5 using the command:

sudo make install
sudo ldconfig

Load tesseract languages

Load tesseract languages by creating a language path.

mkdir -p /tess/traineddata

Export the Tesseract path by adding the below line to ~/.bashrc.

export TESSDATA_PREFIX=/home/$USER/tess/traineddata

You can replace $USER with the exact username on the system

Source the profile.

source ~/.bashrc

Now add any trained data available on Github tessdata to the path.

cd $TESSDATA_PREFIX
wget https://github.com/tesseract-ocr/tessdata/raw/main/eng.traineddata
wget https://github.com/tesseract-ocr/tessdata/raw/main/fra.traineddata

Method 2 – Install Tesseract OCR 5 on Rocky Linux 8|AlmaLinux 8 Using Docker/Podman

I assume that you already have Docker/Podman Engine installed on your system. Otherwise, install Docker Engine on Rocky Linux 8|AlmaLinux 8 using the guide below:

Podman can be installed using the command below:

sudo yum install podman

Once the preferred container engine has been installed, create the Tesseract OCR 5 docker file.

vim Dockerfile

In the file, add the lines below

FROM ubuntu
RUN apt-get update -y
ENV DEBIAN_FRONTEND=noninteractive 
RUN apt-get install -y gnupg apt-transport-https apt-utils wget
RUN echo "deb https://notesalexp.org/tesseract-ocr5/focal/ focal main" \
|tee /etc/apt/sources.list.d/notesalexp.list > /dev/null
RUN wget -O - https://notesalexp.org/debian/alexp_key.asc | apt-key add -
RUN apt-get update -y
RUN apt-get install tesseract-ocr -y
RUN apt install imagemagick -y
ENTRYPOINT ["tesseract"]
RUN tesseract -v

Save the file and build the Tesseract 5 image:

$ docker build . -t tesseract:5
##OR
$ podman build . -t tesseract:5

Once build you will see this

Removing intermediate container a9ecd5a7810f
 ---> e1f76f250bc1
Step 10/11 : ENTRYPOINT ["tesseract"]
 ---> Running in 1886dfbcfd5e
Removing intermediate container 1886dfbcfd5e
 ---> 30afc1531eb9
Step 11/11 : RUN tesseract -v
 ---> Running in 598972cbd362
tesseract 5.0.1
 leptonica-1.79.0
  libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 2.0.3) : libpng 1.6.37 : libtiff 4.1.0 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.1
 Found AVX2
 Found AVX
 Found FMA
 Found SSE4.1
 Found OpenMP 201511
 Found libarchive 3.4.0 zlib/1.2.11 liblzma/5.2.4 bz2lib/1.0.8 liblz4/1.9.2 libzstd/1.4.4
 Found libcurl/7.68.0 OpenSSL/1.1.1f zlib/1.2.11 brotli/1.0.7 libidn2/2.2.0 libpsl/0.21.0 (+libidn2/2.2.0) libssh/0.9.3/openssl/zlib nghttp2/1.40.0 librtmp/2.3
Removing intermediate container 598972cbd362
 ---> 35f8f89ea444
Successfully built 35f8f89ea444
Successfully tagged tesseract:5

Check the image:

##For Docker 
$ docker images
REPOSITORY   TAG       IMAGE ID       CREATED          SIZE
tesseract    5         35f8f89ea444   14 seconds ago   302MB

##For Podman
$ podman images
REPOSITORY                        TAG         IMAGE ID      CREATED        SIZE
localhost/tesseract               5           f92efe531f58  2 minutes ago  309 MB

With the Tesseract image, you can now run Tesseract using the syntax.

##For Docker
docker run --rm -v /path/to/image/:/tmp/:z tesseract:5 /tmp/img.jpg stdout

##For Podman
podman run --rm -v /path/to/image/:/tmp/:z localhost/tesseract:5 /tmp/img.jpg stdout

In the above command:

  • –rm removes the conatiner after running the above command
  • v specifies the volume with /path/to/image/ as the local path to be mounted on /tmp/ of the conatiner and /tmp/img.jpg is the exact location of the image file the mounted local path.
  • tesseract:5 is the conatiner image created
  • z modifies selinux label. It indicates that the bind mount content is shared among multiple containers

Use Tesseract OCR 5 on Rocky Linux 8|AlmaLinux 8.

Once Tesseract OCR 5 has been installed, you can now start extracting text from scanned documents or images.

To convert an image to a text file, the syntax used is as below.

tesseract <image_name> <output file_name>

For example, converting an image to a text file the command will be:

tesseract image.png new

The output will be a text file- new of the image file- image.png.

For Docker/Podman, the command will be as below assuming the image file image.png is at /home/thor/Desktop/image.png

##For Docker
docker run --rm -v /home/thor/Desktop/:/tmp/:z tesseract:5 /tmp/image.png stdout

##For Podman
podman run --rm -v /home/thor/Desktop/:/tmp/:z localhost/tesseract:5 /tmp/image.png stdout

Or alternatively, shorten the command as below(Example with Podman) after navigating into the directory with the image file.

podman run --rm -v $(pwd)/:/tmp/:z localhost/tesseract:5 /tmp/image.png stdout 

This is the text output of the image.png in the directory

You can also have the output above to a file say new.txt as below.

tesseract image.png new

Example for Docker:

docker run --rm -v $(pwd)/:/tmp/:z tesseract:5 /tmp/image.png /tmp/new

Or to a PDF file(new.pdf) as below.

tesseract image.png new pdf

Example for docker

docker run --rm -v $(pwd)/:/tmp/:z tesseract:5 /tmp/image.png /tmp/new pdf

Specify the language.

When using Tesseract OCR you can specify the language you want to use with the -l flag. For example, use Czech.

tesseract image.png new -l ces

You can specify multiple languages as well.

tesseract image.png new -l ces+eng

The end!

Closing Thoughts.

We have triumphantly installed and used Tesseract OCR 5 on Rocky Linux 8|AlmaLinux 8. You can now extract text from images and scanned documents easily. I hope this was helpful.

See more:


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK