Current Issue Cover
复杂背景下的手势识别

王银1,2, 陈云龙1,2, 孙前来1,2(1.太原科技大学电子信息工程学院, 太原 030024;2.先进控制与装备智能化山西省重点实验室, 太原 030024)

摘 要
目的 手势识别是人机交互领域的热点问题。针对传统手势识别方法在复杂背景下识别率低,以及现有基于深度学习的手势识别方法检测时间长等问题,提出了一种基于改进TinyYOLOv3算法的手势识别方法。方法 对TinyYOLOv3主干网络重新进行设计,增加网络层数,从而确保网络提取到更丰富的语义信息。使用深度可分离卷积代替传统卷积,并对不同网络层的特征进行融合,在保证识别准确率的同时,减小网络模型的大小。采用CIoU(complete intersection over union)损失对原始的边界框坐标预测损失进行改进,将通道注意力模块融合到特征提取网络中,提高了定位精度和识别准确率。使用数据增强方法避免训练过拟合,并通过超参数优化和先验框聚类等方法加快网络收敛速度。结果 改进后的网络识别准确率达到99.1%,网络模型大小为27.6 MB,相比原网络(TinyYOLOv3)准确率提升了8.5%,网络模型降低了5.6 MB,相比于YOLO(you only look once)v3和SSD(single shot multibox detector)300算法,准确率略有降低,但网络模型分别减小到原来的1/8和1/3左右,相比于YOLO-lite和MobileNet-SSD等轻量级网络,准确率分别提升61.12%和3.11%。同时在自制的复杂背景下的手势数据集对改进后的网络模型进行验证,准确率达到97.3%,充分证明了本文算法的可行性。结论 本文提出的改进Tiny-YOLOv3手势识别方法,对于复杂背景下的手势具有较高的识别准确率,同时在检测速度和模型大小方面都优于其他算法,可以较好地满足在嵌入式设备中的使用要求。
关键词
Hand gesture recognition in complex background

Wang Yin1,2, Chen Yunlong1,2, Sun Qianlai1,2(1.College of Electronic Information Engineering, Taiyuan University of Science and Technology, Taiyuan 030024, China;2.Shanxi Key Laboratory of Advanced Control and Equipment Intelligence, Taiyuan 030024, China)

Abstract
Objective The rapid development of artificial intelligence and target detection technology has accelerated the iteration of intelligent devices and also promoted the development of related technologies in the field of human-computer interaction. As an important body language and an important means to realize human-computer interaction, gesture recognition has attracted considerable attention. It has the characteristics of simplicity, high efficiency, directness, and rich content. This interaction mode is more in line with people's daily behavior and easier to understand. Gesture recognition has a wide application prospect in smart homes, virtual reality, sign language recognition, and other fields. Gesture recognition involves a wide range of disciplines such as image processing, ergonomics, machine vision, and deep learning. In addition, due to the variety of gestures and the complexity of practical application environment, gesture recognition has become a very challenging research topic. Method The traditional vision-based gesture recognition method mainly uses the skin color or skeleton model of the human body to partially segment gestures and realizes gesture classification through manual design and extraction of effective features. However, the collected RGB images are greatly affected by light conditions, skin color, clothing, and background. In the case of backlight, dark light, and dark skin color, the effects of segmentation and gesture recognition are poor. Features such as texture and edge are extracted manually, feature omission and misjudgment are easily generated during the extraction process. The recognition rate is low, and the robustness is poor under complex background. In recent years, deep learning technology has attracted more and more attention due to its robustness and high accuracy. The convolutional neural network model based on deep learning has gradually replaced the traditional manual feature extraction method and gradually become the mainstream method of gesture recognition. Although existing mainstream deep learning methods such as you only look once (YOLO) and single shot multibox detector (SSD) have achieved high accuracy in gesture recognition under complex backgrounds, their models are generally large, which is a difficult requirement to meet. It is difficult to achieve real-time detection effect with embedded devices and detection time. Therefore, how to reduce the complexity of the model and algorithms while ensuring the detection accuracy and meeting the requirements of real-time detection in practical applications has become an urgent problem that needs to be solved. The TinyYOLOv3 algorithm has the advantages of fast detection speed and small model, but its recognition accuracy is far from meeting the requirements of practical application. Therefore, to solve the above problems, this study proposes a gesture recognition method based on the improved TinyYOLOv3 algorithm. In this study, the TinyYOLOv3 backbone network is redesigned. The convolution operation with stride of 2 is used to replace the original maximum pooling layer, and the number of network layers is increased to ensure that the network extracts richer semantic information. At the same time, the depthwise separable convolution is used to replace the traditional convolution, and the characteristics of different network layers are integrated to reduce the size of the network model, ensure the recognition accuracy, and avoid the loss of feature information due to the deepening of the network structure. In the improvement of the loss function, the CIoU(complete intersection over union) loss is used to replace the original bounding box coordinate to predict loss. The experimental results show that CIoU is helpful to speed up the convergence of the model, reduce the training time, and improve the accuracy to a certain extent. The channel attention module is integrated into the feature extraction network, and the information of different channels is recalibrated to reduce the increase of parameters and improve the recognition accuracy. The data enhancement method is used to avoid overfitting training, and super parameter optimization, dynamic learning rate setting, prior frame clustering, and other methods are used to accelerate network convergence. Result This study uses the NUS-Ⅱ(National University of Singapore) gesture dataset for verification experiments. The experimental results show that the accuracy rate of the improved network recognition reaches 99.1%, which is 8.5% higher than that of the original network (TinyYOLOv3, 90.6%), and the size of the network model is reduced from 33.2 MB to 27.6 MB. Compared with YOLOv3, the recognition accuracy of the improved algorithm is slightly reduced; however, the detection speed is nearly doubled, the model size is about one-eighth that of YOLOv3, and the number of parameters is also reduced by nearly 10 times, verifying the feasibility of the algorithm. At the same time, ablation experiments were carried out on different improved modules, and the results showed that the improvement of each module helped to improve the accuracy of the algorithm. By comparing and analyzing the accuracy and loss rate changes of Tiny YOLOv3, improved Tiny YOLOv3 by CIoU and the algorithm in this paper, the advantages of the algorithm in this paper in training time and convergence speed are verified. The advantages of this algorithm in terms of training time and convergence speed are verified. This study also compared the improved algorithm with some classical traditional and deep learning gesture recognition algorithms. In terms of model size, detection time, and accuracy, the algorithm in this study achieved better results. Conclusion Gesture recognition in complex background is a key and difficult problem in the field of gesture recognition. To solve the problems of low gesture recognition rate of traditional gesture recognition methods in complex background and long detection time of existing gesture recognition methods based on deep learning, a gesture recognition method based on improved TinyYOLOv3 algorithm is proposed in this study. The network structure, loss function, feature channel optimization, and prior frame clustering are improved. The use of depthwise separable convolution makes it possible to deepen the network while reducing the number of parameters. The deepening of the network structure and the optimization of the feature channel enable the network to extract more effective semantic information and improve the detection effect. The improved network not only ensures the accuracy, but also takes into account the balance between the network model size and detection time, which can meet the use requirements of embedded equipment.
Keywords

订阅号|日报