A Recognition Model for Violent Sorting Activity Based on the ST-AGCN Algorithm
-
摘要: 目前快递物流行业普遍存在分拣人员暴力分拣现象,为减少此类行为可采用基于图像的行为识别方法,但这种方法在实际场景中存在算法鲁棒性差、人体关节点数据难获取等问题。针对上述问题,制作了1个物流暴力分拣行为视频数据集,研究了暴力分拣行为识别模型。通过树莓派采集室内外2种情景下的分拣视频数据,利用Python socket模块实现视频图像实时传输,采用切片筛选规则除去非标准数据,应用OpenPose模型获取关节点数据。针对一般人体行为识别网络模型无法较好反映暴力分拣关节点对动作重要影响程度的问题,研究了以ST-GCN为主干网络的优化图神经网络模型ST-AGCN。利用空间注意力机制学习不同关节点对于各种动作的影响,以更新各关节点的权重;通过增加自适应图结构层以端到端学习方式将人体骨骼图的拓扑结构与网络参数共同优化,突出关联度高的关节点对动作识别的影响。以室内外环境下暴力分拣视频为对象开展和多种深度学习模型的对比实验和消融实验,实验结果表明:ST-AGCN模型识别现实场景中暴力分拣行为的准确率相比ST-GCN、STA-LSTM、不含空间注意力机制的ST-AGCN和不含自适应图结构层的ST-AGCN模型分别提高了5.6%,13.82%,2.36%,1.61%,且适用于室内外环境杂乱、局部遮挡等复杂的物流分拣场景,验证了ST-AGCN的优越性以及空间注意力机制和自适应图结构层的有效性。Abstract: An image-based behavior recognition method can be utilized to address the issue of violent sorting which is prevalent within the express logistics industry. However, this method presents challenges including algorithmic fragility and the difficulty in obtaining joint point data in practical scenarios. In response to these challenges, a video dataset is generated to capture instances of violent sorting behaviors in logistics, and a model is developed to identify such behaviors. Video data from both indoor and outdoor scenarios is collected, with real-time video image transmission achieved using the Python socket module. Screening rules are applied to eliminate non-standard data, and the OpenPose model is employed to obtain joint data. To address the limitation of general recognition network in reflecting the impact of joint points on actions, an optimized graph neural network is developed based on ST-GCN. The spatial attention mechanism is used to understand the influence of different joints on various movements, updating the weight of each joint. The topology and network parameters of the human bone map are optimized through end-to-end learning to emphasize the influence of key joints on action recognition. Comparative and ablation experiments are conducted on various deep learning models using violent sorting videos captured in indoor and outdoor environments. The experimental results indicate that the accuracy of ST-AGCN model for identifying violent sorting behavior in real scenes is 5.6% higher than ST-GCN. Compared with STA-LSTM, ST-AGCN without spatial attention mechanism, and ST-AGCN without the adaptive graph structure layer, the accuracy of ST-AGCN model is improved by 13.82%, 2.36%, and 1.61% respectively, which indicates the ST-AGCN model is also suitable for complex logistics sorting scenes in cluttered indoor and outdoor environments and partial occlusion, and verifies the superiority of ST-AGCN and the effectiveness of the spatial attention mechanism and the adaptive graph structure layer.
-
表 1 室外暴力分拣场景及对应视频数量
Table 1. Outdoor violence sorting scene and the corresponding number of videos
单位: 个 场景 环境杂乱 局部遮挡 拍摄不全 在面包车中 单人 10 10 10 13 双人 10 10 10 13 三人 10 10 10 13 表 2 室内暴力分拣场景及对应视频数量
Table 2. Indoor violence sorting scene and the corresponding number of videos
单位: 个 场景 光线不足 环境杂乱 局部遮挡 拍摄不全 单人 10 10 13 10 双人 10 10 13 10 三人 10 10 13 10 表 3 各类动作视频片段数量
Table 3. Number of action video clips of all kinds
动作类型 视频数量/个 正常 490 摔 241 踢 279 砸 272 丢 540 表 4 对比实验结果
Table 4. Results of the comparison experiment
模型类别 准确率/% STA-LSTM 44.44 ST-GCN 52.66 Shift-GCN 57.22 2s-AGCN 56.46 ST-AGCN 58.26 表 5 Attention机制消融实验结果
Table 5. Results of the Attention mechanism ablation experiment
模型类别 准确率/% 平均拒识率/% ST-AGCN w/o SA 55.90 12.03 ST-AGCN 58.26 10.67 表 6 自适应图结构消融实验结果
Table 6. Results of the adaptive graph ablation experiment
模型类别 准确率/% 平均拒识率/% ST-AGCN w/o adaptive graph 56.65 11.61 ST-AGCN 58.26 10.67 表 7 单元堆叠数目消融实验结果
Table 7. Results of the unit stack number ablation experiment
ST-AGCN层数 准确率/% 平均拒识率/% 时间/s 1 40.36 24.38 1 312 3 45.10 19.70 2 150 5 52.66 14.65 4 037 7 56.46 11.71 6 808 10 58.26 10.67 8 632 12 52.18 14.31 10 550 表 8 现场测试的误识率和拒识率
Table 8. Misidentification rate and rejection rate of field tests
单位: % 动作类型 误识率 拒识率 丢 16.17 12.45 踢 19.00 12.99 正常 21.82 8.61 砸 19.94 12.92 抛 23.08 6.37 -
[1] 顾欣. 高速公路互通立交合流区交通冲突预测模型研究[D]. 南京: 东南大学, 2022.GU X. Research on prediction model of traffic conflict in the confluence area of expressway interchange[D]. Nanjing: Southeast University, 2022. (in Chinese) [2] 谭志荣, 陈维, 王辉, 等. 基于视频识别技术的船舶视觉盲区增强方法研究[J]. 中国水运, 2020, 12(2): 108-109.TAN Z R, CHEN W, WANG H, et al. Research on the enhancement method of ship visual blind spot based on video recognition technology[J]. China Water Transport, 2020, 12 (2): 108-109. (in Chinese) [3] 丁奥, 张媛, 朱磊, 等. 基于加速度分布特征的快递暴力分拣识别方法[J]. 包装工程, 2020, 41(23): 162-171.DING A, ZHANG Y, ZHU L, et al. Recognition method for rough handling of express parcels based on acceleration distribution features[J]. Packaging Engineering, 2020, 41(23): 162-171. (in Chinese) [4] 徐燕, 刘军, 周丽, 等. 防暴力分拣的主动式快递分拣作业辅助和评价系统及方法: CN201910942514.1[P]. 2019-12-06.XU Y, LIU J, ZHOU L, et al. Active express sorting operation assistance and evaluation system and method for anti-violent sorting: CN201910942514. 1[P]. 2019-12-06. (in Chinese) [5] 范洪博, 郭全, 张晶, 等. 1种基于LoRa的防暴力分拣防丢的物流实时监控装置: CN201820075504. 3[P]. 2018-10-12.FAN H B, GUO Q, ZHANG J, et al. A new real-time logistics monitoring device based on LoRa: CN201820075504.3[P]. 2018-10-12. (in Chinese) [6] 尚淑玲. 基于计算机视觉的物流暴力分拣行为识别[J]. 计算机仿真, 2013, 30(12): 430-433.SHANG S L. Identification of logistics violence sorting behavior based on computer vision[J]. Computer Simulation, 2013, 30(12): 430-433. (in Chinese) [7] 罗雪阳, 蔡锦达. 基于深度学习的图像分类算法框架研究[J]. 包装工程, 2021, 42(21): 181-187.LUO X Y, CAI J D. Research on image classification algorithm framework based on deep learning[J]. Packaging Engineering, 2021, 42(21): 181-187. (in Chinese) [8] DU Y, WANG W, WANG L. Hierarchical recurrent neural network for skeleton based action recognition[C]. IEEE Conference on Computer Vision and Pattern Recognition(CVPR), Boston, MA, USA: IEEE, 2015. [9] FENG J G, ZHANG S Y, XIAO J. Explorations of skeleton features for LSTM-based action recognition[J]. Multimed Tools Appl, 2017, 78(1): 591-603. [10] ZHU A C, WU Q Y, CUI R, et al. Exploring a rich spatial-temporal dependent relational model for skeleton-based action recognition by bidirectional LSTM-CNN[J]. Neurocomputing, 2020(5): 90-100. [11] DING Y K, ZHU Y L, WU Y R, et al. Spatio-temporal attention LSTM model for flood forecasting[C]. IEEE Green Computing and Communications and IEEE Cyber, Physical and Social Computing and IEEE Smart Data, Atlanta, GA, USA: IEEE, 2019. [12] LIU J, SHAHROUDY A, XU D, et al. Skeleton-based action recognition using spatio-temporal LSTM network with trust gates[J]. IEEE Transactions on Pattern Analysis & Machine Intelligence, 2017, 40(12): 3007-3021. [13] YANG J Y, LIU W, YUAN J S, et al. Hierarchical soft quantization for skeleton-based human action recognition[J]. IEEE Transactions on Multimedia, 2020, 23: 883-898. [14] KIPF T N, WELLING M. Semi-supervised classification with graph convolutional networks[J]. arXiv preprint arXiv, 2016, 1609, 2907. [15] SHI L, ZHANG Y F, CHENG J, et al. Two-stream adaptive graph convolutional networks for skeleton-based action recognition[C]. The IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, USA: IEEE, 2019. [16] YAN S J, XIONG Y J, LIN D H. Spatial temporal graph convolutional networks for skeleton-based action recognition[C]. 32nd AAAI Conference on Artificial Intelligence, New Orleans, Lousiana, USA: AAAI, 2018. [17] CHENG K, ZHANG Y F, HE X Y, et al. Skeleton-based action recognition with shift graph convolutional network[C]. The IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA: IEEE, 2020. [18] 翟龙真. 基于人因工程学的快递分拣作业优化研究[D]. 长沙: 南华大学, 2016.ZHAI L Z. Research on optimization of express sorting operation based on human factors engineering[D]. Changsha: University of South China, 2016. (in Chinese) [19] 刘星余. 面向物流仓储分拣机器人的多目标视觉识别与定位方法研究[J]. 粘接, 2021, 47(7): 109-112.LIU X Y. Research on multi-objective visual recognition and positioning method for logistics warehousing and sorting robot[J]. Adhesion, 2021, 47(7): 109-112. (in Chinese) [20] 李龙棋, 方美发, 唐晓腾. 树莓派平台下的实时监控系统开发[J]. 闽江学院学报, 2014, 35(5): 67-72.LI L Q, FANG M F, TANG X T. Development of real-time monitoring system based on raspberry pie platform[J]. Journal of Minjiang University, 2014, 35(5): 67-72. (in Chinese) [21] CAO Z, SIMON T, WEI S E, et al. Realtime multi-person 2D pose estimation using part affinity fields[C]. 2017 IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA: IEEE, 2017, 143. [22] 刘勇, 李杰, 张建林, 等. 基于深度学习的二维人体姿态估计研究进展[J]. 计算机工程, 2021, 47(3): 1-16.LIU Y, LI J, ZHANG J L, et al. Research progress of 2D human pose estimation on deep learning[J]. Computer Engineering, 2021, 47(3): 1-16. (in Chinese) [23] WEBERING F, BLUME H, ALLAHAM I. Markerless camera-based vertical jump height measurement using OpenPose[C]. The IEEE/ CVF Conference on Computer Vision and Pattern Recognition, Online: IEEE, 2021. [24] SAHIN I, MODI A, KOKKONI E. Evaluation of OpenPose for quantifying infant reaching motion[J]. Arch Phys Med Rehabil, 2021, 102(10): e86. [25] PAPANDREOU G, ZHU T, CHEN L C, et al. PersonLab: Person pose estimation and instance segmentation with a bottom-up, part-based, geometric embedding model[C]. Computer Vision -European Conference on Computer Vision, Munich, Germany: ECCV, 2018. [26] WEI S E, RAMAKRISHNA V, KANADE T, et al. Convolutional pose machines[C]. 2016 IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, USA: IEEE, 2016. [27] YAN S J, XIONG Y J, LIN D H. Spatial temporal graph convolutional networks for skeleton-based action recognition[C]. AAAI Conference on Artificial Intelligence, New Orleans, Louisiana, USA: AAAI, 2018. [28] MAO X X, WU S X, XIAN B L, et al. Adaptive graph convolution and LSTM action recognition based on skeleton[J]. Journal of East China University of Science and Technology, 2021, 48: 1-10. [29] 刘海洲, 张敬宇. 基于大数据的城市轨道交通出行站外OD位置点识别方法研究[J]. 铁道运输与经济, 2022, 44(8): 115-122.LIU H Z, ZHANG J Y. Research on OD location point identification method outside urban rail transit travel stations based on big data[J]. Railway Transport and Economy, 2022, 44(8): 115-122. (in Chinese) [30] DEMOKRI D P, JOUDAKI S, KOLIVAND H. A new traffic sign recognition technique taking shuffled frog-leaping algorithm into account[J]. Wireless Personal Communications, 2022, 125(4): 11-17.