6

“Reading text with deep learning”

 3 years ago
source link: https://jhui.github.io/2017/01/15/OCR-with-deep-learning/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

“Reading text with deep learning”

Jan 15, 2017

Deepmind’s end-to-end text spotting pipeline using CNN

Image in this section is taken from Source Max Jaderberg et al unless stated otherwise.

Proposal generations

First, it uses cheap classifiers to produce high recall region proposals but not necessary with high precision. Later, it introduces methods to reject more false positives. A proposal is consider a true positive if it overlaps with the true value by a threshold value:

bp∩btbp∪bt>thresholdbp∩btbp∪bt>threshold

It uses a threshold as low as 50% which may generates many false positive.

  • Use both edge boxes to give proposals
edgeb.png

Source Lawrence Zitnick et al

detect.png

Filter and refinement

Thousands of bounding boxes are generated in the proposal generations which many are false-positive. It uses

  • a random forest classifier on HOG features to eliminate false positive.
  • and a bounding box regressor using CNN to refine the boundary box (from red to green).
np.png

Text Recognition

Generate training data:

gimg.png

CNN classifier:

w∗P(w|x,L)w∗=argmaxw∈WP(w|x,L)which L is the language=P(w|x)P(w|L)P(x)P(x|L)P(w)=P(w|x)P(w|L)P(w)given x is independent of L=argmaxw∈WP(w|x)P(w|L)w∗=arg⁡maxw∈WP(w|x,L)which L is the languageP(w|x,L)=P(w|x)P(w|L)P(x)P(x|L)P(w)=P(w|x)P(w|L)P(w)given x is independent of Lw∗=arg⁡maxw∈WP(w|x)P(w|L)

P(w|x)P(w|x) is modelled by the softmax output of CNN by resample the region to a fixed height and width.

cnn.png

and the language based word prior P(w|L)P(w|L) can be modelled by a lexicon.

Merging & ranking

It may still contain false positives and duplicates, so a final merging and ranking of detections is done. (text spotting)

For each bonding box:

wbsb=argmaxw∈WP(w|b,I)=maxw∈WP(w|b,I)wb=arg⁡maxw∈WP(w|b,I)sb=maxw∈WP(w|b,I)

To merge the detections of the same word, it applies a non maximum suppression (NMS) on detections with the same word label. It also performs NMS to suppress non-maximal detections of different words with some overlap.

It performs multiple rounds of bounding box regression and NMS to refine the bounding box. Performing NMS between each regression causes similar bounding boxes to be grouped as a single detection. This causes the overlap of detections to converge on a more accurate detection.

mps.png

Dropbox OCR

Image in this section is from here

box.png

Word detector

Use Maximally stable extremal regions in OpenCV for word detector.

Word deep net

box2.png

If the score was somewhere in the middle, it runs through a lexicon generated from the Oxford English Dictionary, applying different transformations between and within word prediction boxes, attempting to combine words or split them using the lexicon.

exp.png

Recursive Recurrent Nets with Attention Modeling for OCR

Recursive Recurrent Nets with Attention Modeling for OCR in the Wild, Chen-Yu Lee

sc22.png

About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK