close

Deep Learning Based Text Detection Using OpenCV (C++/Python)

The common saying is, “A picture is worth a thousand words.” In this post, we will take that literally and try to find the words in a picture! In an earlier post about Text Recognition, we discussed how Tesseract works and how it can be used along with OpenCV for

BERJAYA

The common saying is, “A picture is worth a thousand words.” In this post, we will take that literally and try to find the words in a picture! In an earlier post about Text Recognition, we discussed how Tesseract works and how it can be used along with OpenCV for text detection and recognition. This time, we will look at a robust approach for Text Detection Using OpenCV, based on a recent paper : EAST: An Efficient and Accurate Scene Text Detector.

 

 

It should be noted that text detection is different from text recognition. In text detection, we only detect the bounding boxes around the text. But, in text recognition, we actually find what is written in the box. For example, in the image below, text detection will give you the bounding box around the word, and text recognition will tell you that the box contains the word STOP.

Example image of a road side traffic stop sign, for discussing text detection using OpenCV.

Text Recognition engines such as Tesseract require the bounding box around the text for better performance. Thus, this detector can be used to detect the bounding boxes before doing Text Recognition.

A tensorflow re-implementation of the paper reported the following speed on 720p (resolution of 1280×720) images (source):

 

  • Graphic Card: GTX 1080 Ti
  • Network fprop: ~50 ms
  • NMS (C++): ~6 ms
  • Overall: ~16 fps

The TensorFlow model has been ported to be used with OpenCV, and they have also provided a sample code. We will discuss how it works step by step. You will need OpenCV >= 3.4.3 to run the code. Let’s detect some text in images!

The steps involved are as follows:

  1. Download the EAST Model
  2. Load the Model into memory
  3. Prepare the input image
  4. Forward pass the blob through the network
  5. Process the output

Step 1: Download EAST Model

The EAST Model can be downloaded from this dropbox link: https://www.dropbox.com/s/r2ingd0l3zt8hxs/frozen_east_text_detection.tar.gz?dl=1.

Once the file has been downloaded (~85 MB), unzip it using

tar -xvzf frozen_east_text_detection.tar.gz

You can also extract the contents using the File viewer of your OS.

After unzipping, copy the .pb model file to the working directory.

Step 2: Load the Network

We will use the cv::dnn::readnet or cv2.dnn.ReadNet() function for loading the network into memory. It automatically detects configuration and framework based on the file name specified. In our case, it is a pb file, and thus, it will assume that a Tensorflow Network is to be loaded.

C++

Net net = readNet(model);

Python

net = cv.dnn.readNet(model)

Step 3: Prepare Input Image

We need to create a 4-D input blob for feeding the image to the network. This is done using the blobFromImage function.

Download Code To easily follow along this tutorial, please download code by clicking on the button below. It's FREE!
BERJAYA

C++

blobFromImage(frame, blob, 1.0, Size(inpWidth, inpHeight), Scalar(123.68, 116.78, 103.94), true, false);

Python

blob = cv.dnn.blobFromImage(frame, 1.0, (inpWidth, inpHeight), (123.68, 116.78, 103.94), True, False)

There are a few parameters we need to specify to this function. They are as follows :

  1. The first argument is the image itself
  2. The second argument specifies the scaling of each pixel value. In this case, it is not required. Thus we keep it as 1.
  3. The default input to the network is 320×320. So, we need to specify this while creating the blob. You can experiment with any other input dimension, also.
  4. We also specify the mean that should be subtracted from each image since this was used while training the model. The mean used is (123.68, 116.78, 103.94).
  5. The next argument is whether we want to swap the R and B channels. This is required since OpenCV uses BGR format and Tensorflow uses RGB format.
  6. The last argument is whether we want to crop the image and take the center crop. We specify False in this case.

Step 4: Forward Pass

Now that we have prepared the input, we will pass it through the network. There are two outputs of the network. One specifies the geometry of the Text-box, and the other specifies the confidence score of the detected box. These are given by the layers :

  • feature_fusion/concat_3
  • feature_fusion/Conv_7/Sigmoid

This is specified in the code as follows:

C++

std::vector<String> outputLayers(2);
outputLayers[0] = "feature_fusion/Conv_7/Sigmoid";
outputLayers[1] = "feature_fusion/concat_3";

Python

outputLayers = []
outputLayers.append("feature_fusion/Conv_7/Sigmoid")
outputLayers.append("feature_fusion/concat_3")

Next, we get the output by passing the input image through the network. As discussed earlier, the output consists of two parts: scores and geometry.

C++

std::vector<Mat> output;
net.setInput(blob);
net.forward(output, outputLayers);

Mat scores = output[0];
Mat geometry = output[1];

Python

net.setInput(blob)
output = net.forward(outputLayers)
scores = output[0]
geometry = output[1]

Step 5: Process The Output

As discussed earlier, we will use the outputs from both the layers ( i.e. geometry and scores ) and decode the positions of the text boxes along with their orientation. We might get many candidates for a text box. Thus, we need to filter out the best looking text boxes from the lot. This is done using Non-Maximum Suppression.

Decode

C++

std::vector<RotatedRect> boxes;
std::vector<float> confidences;
decode(scores, geometry, confThreshold, boxes, confidences);

Python

[boxes, confidences] = decode(scores, geometry, confThreshold)

Non-Maximum Suppression

We use the OpenCV function NMSBoxes ( C++ ) or NMSBoxesRotated ( Python ) to filter out the false positives and get the final predictions.

C++

std::vector<int> indices;
NMSBoxes(boxes, confidences, confThreshold, nmsThreshold, indices);

Python

indices = cv.dnn.NMSBoxesRotated(boxes, confidences, confThreshold, nmsThreshold)

Results

Given below are a few results.

Result of non maxima suppression on an image of traffic stop sign.
Result of non maxima suppression on an image of traffic instruction at hospital entry/exit point.
Result of non maxima suppression on an image with multiple bank cards.
Result of non maxima suppression on 4 wheeler image, registration info and text detected highlighted.
Result of non maxima suppression on traffic stop sign.
Result of non maxima suppression on traffic sign.
Result of non maxima suppression on road condition caution sign.
Result of non maxima suppression on image with multiple number plates.
Result of non maxima suppression on image of parking lot, person and parked 2 wheelers.

As you can see, it can detect texts with varying Backgrounds, Fonts, Orientations, Sizes, Color. The last one worked pretty well, even for deformed Text. There are, however, some mis-detections but we can say, overall it performs very well.

As the examples suggest, it can be used in a wide variety of applications such as Number plate Detection, Traffic Sign Detection, detection of text on ID Cards etc.

References

  1. EAST: An Efficient and Accurate Scene Text Detector
  2. Tensorflow Implementation
  3. OpenCV Samples [C++], [Python]


Leave a Reply

Your email address will not be published. Required fields are marked *

Prove your humanity: 10   +   8   =  

Read Next

How to Master YOLOE: Real-Time Open-Vocabulary Detection Made Easy

How to Master YOLOE: Real-Time Open-Vocabulary Detection Made Easy

Learn YOLOE for real-time open-vocabulary object detection and instance segmentation in Python with Ultralytics — text, visual, and prompt-free modes.

YOLO26 Keypoint Estimation: Real-Time Pose Estimation with Ultralytics

YOLO26 Keypoint Estimation: Real-Time Pose Estimation with Ultralytics

Learn how to use YOLO26-pose with Python for real-time keypoint estimation on images and videos, understand its RLE-based architecture, and…

YOLO26 Instance Segmentation: Pixel-Perfect AI at Real-Time Speed

YOLO26 Instance Segmentation: Pixel-Perfect AI at Real-Time Speed

Build a complete pipeline for YOLO26 instance segmentation, from image and video inference to custom dataset training and edge deployment.

Subscribe to our Newsletter

Subscribe to our email newsletter to get the latest posts delivered right to your email.

BERJAYA Click here to download the source code to this post

Subscribe to receive the download link, receive updates, and be notified of bug fixes

Which email should I send you the download link?

 

Get Started with OpenCV

Subscribe To Receive
BERJAYA

We hate SPAM and promise to keep your email address safe.​