EAST 自然場景文字檢測實踐(EAST: An Efficient and Accurate Scene Text Detector)

阿新 • • 發佈：2019-01-14

自然場景文字是影象高層語義的一種重要載體，近些年自然場景影象中的文字檢測與識別技術越來越引起人們的重視。特別是近年來ICDAR的歷界比賽，更是逐漸將這一領域的score不斷提升。如http://rrc.cvc.uab.es/?ch=4&com=evaluation&task=1&gtv=1這個主頁上看到列出的result結果，都有達到90%多的。此外，一些大公司的AI都提供了這樣的介面，如百度AI的SDK上已經有這一塊，看起來非常猛的樣子。

自然場景文字檢測是影象處理的核心模組，也是一直想要接觸的一個方面。剛好看到國內的曠視今年在CVPR2017的一篇文章：EAST: An Efficient and Accurate Scene Text Detector。而且有開放的程式碼，學習和測試了下。

題目說的是比較高效，它的高效主要體現在對一些過程的消除，其架構就是下圖中對應的E部分，跟上面的比起來的確少了比較多的過程。這與去年經典的CTPN架構類似。不過CTPN只支援水平方向，而EAST在論文中指出是可以支援多方向文字的定位的。

論文采用的架構如下：

這個架構的細節應該包括幾個部分：

(1) The algorithm follows the general design ofDenseBox [9], in which an image is fed into the FCN andmultiple channels of pixel-level text score map and geometryare generated. 從論文中這句話可以看出，參考了DenseBox的架構，採用FCN網路，同時在多個通道中進行特徵層的輸出與幾何的生成。

(2) 文中採用了兩種幾何物件，rotated box (RBOX) and quadrangle (QUAD)，通過這兩種，可以實現對多方向場景文字的檢測。

(3) 採用了Locality-Aware NMS來對生成的幾何進行過濾，這也是程式碼中lanms(C++)程式碼的因素。

模型的實現原始碼如下：

def model(images, weight_decay=1e-5, is_training=True):
    '''
    define the model, we use slim's implemention of resnet
    '''
    images = mean_image_subtraction(images)

    with slim.arg_scope(resnet_v1.resnet_arg_scope(weight_decay=weight_decay)):
        logits, end_points = resnet_v1.resnet_v1_50(images, is_training=is_training, scope='resnet_v1_50')

    with tf.variable_scope('feature_fusion', values=[end_points.values]):
        batch_norm_params = {
        'decay': 0.997,
        'epsilon': 1e-5,
        'scale': True,
        'is_training': is_training
        }
        with slim.arg_scope([slim.conv2d],
                            activation_fn=tf.nn.relu,
                            normalizer_fn=slim.batch_norm,
                            normalizer_params=batch_norm_params,
                            weights_regularizer=slim.l2_regularizer(weight_decay)):
            f = [end_points['pool5'], end_points['pool4'],
                 end_points['pool3'], end_points['pool2']]
            for i in range(4):
                print('Shape of f_{} {}'.format(i, f[i].shape))
            g = [None, None, None, None]
            h = [None, None, None, None]
            num_outputs = [None, 128, 64, 32]
            for i in range(4):
                if i == 0:
                    h[i] = f[i]
                else:
                    c1_1 = slim.conv2d(tf.concat([g[i-1], f[i]], axis=-1), num_outputs[i], 1)
                    h[i] = slim.conv2d(c1_1, num_outputs[i], 3)
                if i <= 2:
                    g[i] = unpool(h[i])
                else:
                    g[i] = slim.conv2d(h[i], num_outputs[i], 3)
                print('Shape of h_{} {}, g_{} {}'.format(i, h[i].shape, i, g[i].shape))

            # here we use a slightly different way for regression part,
            # we first use a sigmoid to limit the regression range, and also
            # this is do with the angle map
            F_score = slim.conv2d(g[3], 1, 1, activation_fn=tf.nn.sigmoid, normalizer_fn=None)
            # 4 channel of axis aligned bbox and 1 channel rotation angle
            geo_map = slim.conv2d(g[3], 4, 1, activation_fn=tf.nn.sigmoid, normalizer_fn=None) * FLAGS.text_scale
            angle_map = (slim.conv2d(g[3], 1, 1, activation_fn=tf.nn.sigmoid, normalizer_fn=None) - 0.5) * np.pi/2 # angle is between [-45, 45]
            F_geometry = tf.concat([geo_map, angle_map], axis=-1)

    return F_score, F_geometry

可以看出，幾何圖的生成過程。

實驗部分：

由於該原始碼已經公佈，進行了測試，效果如下：

(1) ICDAR相關的資料集測試，有相當部分的效果還是可以的，這也是文中所說能夠達到80%的，也有一定的可信之處。