Enhancing Smart City Safety and Utilizing AI Expert Systems for Violence Detection
<p>The proposed model of attack alerting system—setup involving an image-to-image generation model, object detection model (YOLO v7), pose estimation model (MediaPipe), and action classification using the LSTM model.</p> "> Figure 2
<p>Generated synthetic image samples of image-to-image stable diffusion.</p> "> Figure 3
<p>LSTM model architecture.</p> "> Figure 4
<p>The proposed YOLO v7 model—real-time predicted results in different attack-type conditions: (<b>a</b>,<b>d</b>) show a detected violent object, a baseball bat; (<b>b</b>) shows a detected violent object, a gun; (<b>c</b>,<b>e</b>) show a detected violent object, a knife.</p> "> Figure 5
<p>Violent attack pose estimation using MediaPipe: (<b>a</b>,<b>d</b>) show an attacking action with a baseball bat; (<b>b</b>,<b>e</b>) show an attacking action with a knife; (<b>c</b>) show an attacking action with a gun.</p> "> Figure 6
<p>(<b>a</b>) The confusion matrix; (<b>b</b>) the precision recall curve for attack types or divergent classes.</p> "> Figure 7
<p>The YOLO v7 main metrics of the training and validation loss and accuracy of the model.</p> "> Figure 8
<p>Performance of violence action classification in MediaPipe using LSTM.</p> ">
Abstract
:1. Introduction
- Our proposed model can handle the small dataset problem using the stable diffusion image generative method, in which new image samples can be generated using previous images to increase the number of images for the object detection model to enhance the performance.
- Our model architecture combines violent object detection (YOLO v7) and pose estimation models (MediaPipe) and an LSTM classifier to improve the performance of the violent attack detection system.
- An edge computing device is implemented and the whole model is deployed in the computing device to test the model using violent-attack testing data in the city.
- A commercial social media API is implemented here for sending the violent object and criminal clip as an alert to the registered number.
2. Methodology
2.1. Dataset
Image-to-Image Stable Diffusion Pipeline Method
- Forward diffusion (noising)
- 2.
- Reverse diffusion (denoising)
2.2. Violence Object Detection Model (YOLO v7)
- Input: This is the initial stage of this model in which input comprising violent images is provided to an algorithm with the images’ corresponding annotations; the size of each input image is 416 × 416 and the images are RGB images that provide their output to the next backbone layer architecture.
- Backbone: The backbone layer networks are processed after input images and mainly comprise three subsections of these modules: MPI module, E-ELAN, and CBS. The MPI model is a combination of CBS processes and MaxPool, with bottom and top branches. The MaxPool model is at the top branches and is utilized to decrease the image’s size in bisection, in both length and width. A CBS process with 128 channel outputs is also utilized to minimize the channel of image sum by fifty per cent and conversely CBS process with a stride and 1 × 1 kernel divides the channels in half numbers. Afterwards, another 2 × 2 stride and 3 × 3 kernel CBS process divide the image dimension in half. Concatenation (Cat) is employed to incorporate the extracted features from that pair of branches. CBS handles the collection of the data from small-scale areas and MaxPool collects from localized locations. The integration techniques of the network raise the capacity to extract useful features from input images.
- Neck: This section of YOLO layer architecture consists of FPN structure (stands for feature pyramid network structure) that employs PAN design structure. The network is composed of many convolutional networks, SiLU activation (CBS Block), and Batch normalization along with spatial pyramid pooling (SPP) and the convolutional spatial pyramid (CSP) that improves outcomes of layers, and this network structure extends Maxpool2 (MP2) and efficient layer aggregation network (ELAN). The number of output channels is always the same in both the MP blocks—the output of this neck layer network transfers to the next prediction module.
- Prediction: The prediction stage is the final stage of this detection algorithm and has a couple of rep structures. The confidence, anchor, and category are evaluated or predicted using a 1 × 1 convolutional layer. The inspiration for this kind of rep structure is VGG or Darknet, which decreases the model complexity without reducing its prediction performance.
2.3. Hyperparameter of Model
2.4. Violent Pose Estimation Model
2.5. Violent Pose Classification Model
2.6. Edge Computing Device and Attack Alerting Method
3. Results
3.1. Detection of YOLO v7 Model and Pose Estimation Model
3.2. Performance Metrics of Model
4. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Baba, M.; Gui, V.; Cernazanu, C.; Pescaru, D. A Sensor Network Approach for Violence Detection in Smart Cities Using Deep Learning. Sensors 2019, 19, 1676. [Google Scholar] [CrossRef] [PubMed]
- Bai, T.; Fu, S.; Yang, Q. Privacy-Preserving Object Detection with Secure Convolutional Neural Networks for Vehicular Edge Computing. Future Internet 2022, 14, 316. [Google Scholar] [CrossRef]
- Ali, S.A.; Elsaid, S.A.; Ateya, A.A.; ElAffendi, M.; El-Latif, A.A.A. Enabling Technologies for Next-Generation Smart Cities: A Comprehensive Review and Research Directions. Future Internet 2023, 15, 398. [Google Scholar] [CrossRef]
- Ullah, F.U.M.; Ullah, A.; Muhammad, K.; Haq, I.U.; Baik, S.W. Violence Detection Using Spatiotemporal Features with 3D Convolutional Neural Network. Sensors 2019, 19, 2472. [Google Scholar] [CrossRef] [PubMed]
- Aremu, T.; Zhiyuan, L.; Alameeri, R.; Khan, M.; Saddik, A.E. SSIVD-Net: A novel salient super image classification & detection technique for weaponized violence. arXiv 2022, arXiv:2207.12850. [Google Scholar]
- Jebur, S.A.; Hussein, K.A.; Hoomod, H.K.; Alzubaidi, L. Novel Deep Feature Fusion Framework for Multi-Scenario Violence Detection. Computers 2023, 12, 175. [Google Scholar] [CrossRef]
- Vosta, S.; Yow, K.-C.A. CNN-RNN Combined Structure for Real-World Violence Detection in Surveillance Cameras. Appl. Sci. 2022, 12, 1021. [Google Scholar] [CrossRef]
- Alrashedy, H.H.N.; Almansour, A.F.; Ibrahim, D.M.; Hammoudeh, M.A.A. BrainGAN: Brain MRI Image Generation and Classification Framework Using GAN Architectures and CNN Models. Sensors 2022, 22, 4297. [Google Scholar] [CrossRef]
- Nichol, A.Q.; Dhariwal, P.; Ramesh, A.; Shyam, P.; Mishkin, P.; Mcgrew, B.; Sutskever, I.; Chen, M. GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models. In Proceedings of the International Conference on Machine Learning, PMLR, Baltimore, ML, USA, 17–23 July 2022; pp. 16784–16804. [Google Scholar]
- Kawar, B.; Zada, S.; Lang, O.; Tov, O.; Chang, H.; Dekel, T.; Mosseri, I.; Irani, M. Imagic: Text-based real image editing with diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 6007–6017. [Google Scholar]
- Avrahami, O.; Lischinski, D.; Fried, O. Blended diffusion for text-driven editing of natural images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 18208–18218. [Google Scholar]
- Borji, A. Generated faces in the wild: Quantitative comparison of stable diffusion, mid-journey and dall-e 2. arXiv 2022, arXiv:2204.06125. [Google Scholar]
- Rombach, R.; Blattmann, A.; Lorenz, D.; Esser, P.; Ommer, B. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 10684–10695. [Google Scholar]
- Xin, Y.; Kong, L.; Liu, Z.; Chen, Y.; Li, Y.; Zhu, H.; Gao, M.; Hou, H.; Wang, C. Machine learning and deep learning methods for cybersecurity. IEEE Access 2018, 6, 35365–35381. [Google Scholar] [CrossRef]
- Khan, S.U.; Haq, I.U.; Rho, S.; Baik, S.W.; Lee, M.Y. Cover the Violence: A Novel Deep-Learning-Based Approach Towards Violence-Detection in Movies. Appl. Sci. 2019, 9, 4963. [Google Scholar] [CrossRef]
- Maity, M.; Banerjee, S.; Sinha, C.S. Faster R-CNN and YOLO based Vehicle detection: A Survey. In Proceedings of the 5th International Conference on Computing Methodologies and Communication (ICCMC), Erode, India, 8–10 April 2021; pp. 1442–1447. [Google Scholar]
- Liu, K.; Tang, H.; He, S.; Yu, Q.; Xiong, Y.; Wang, N. Performance validation of YOLO variants for object detection. In Proceedings of the 2021 International Conference on Bioinformatics and Intelligent Computing, Harbin, China, 22–24 January 2021; pp. 239–243. [Google Scholar]
- Hussain, M. YOLO-v1 to YOLO-v8: The Rise of YOLO and Its Complementary Nature toward Digital Manufacturing and Industrial Defect Detection. Machines 2023, 7, 677. [Google Scholar] [CrossRef]
- Chen, D.; Ju, Y. SAR ship detection based on improved YOLOv3. In Proceedings of the IET International Radar Conference (IET IRC 2020), Online, 4–6 November 2020; pp. 929–934. [Google Scholar]
- Li, Y.; Zhao, Z.; Luo, Y.; Qiu, Z. Real-Time Pattern-Recognition of GPR Images with YOLO v3 Implemented by Tensorflow. Sensors 2020, 20, 6476. [Google Scholar] [CrossRef] [PubMed]
- Wahyutama, A.B.; Hwang, M. YOLO-Based Object Detection for Separate Collection of Recyclables and Capacity Monitoring of Trash Bins. Electronics 2022, 11, 1323. [Google Scholar] [CrossRef]
- Zhou, F.; Deng, H.; Xu, Q.; Lan, X. CNTR-YOLO: Improved YOLOv5 Based on ConvNext and Transformer for Aircraft Detection in Remote Sensing Images. Electronics 2023, 12, 2671. [Google Scholar] [CrossRef]
- Xiao, Y.; Chang, A.; Wang, Y.; Huang, Y.; Yu, J.; Huo, L. Real-time Object Detection for Substation Security Early-warning with Deep Neural Network based on YOLO-V5. In Proceedings of the IEEE IAS Global Conference on Emerging Technologies (GlobConET), Arad, Romania, 20–22 May 2022; pp. 45–50. [Google Scholar]
- Fan, L.; Rao, H.; Yang, W. 3D Hand Pose Estimation Based on Five-Layer Ensemble CNN. Sensors 2021, 21, 649. [Google Scholar] [CrossRef]
- Luvizon, D.C.; Picard, D.; Tabia, H. 2D/3D Pose Estimation and Action Recognition Using Multitask Deep Learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018; pp. 5137–5146. [Google Scholar]
- Fang, H.S.; Li, J.; Tang, H.; Xu, C.; Zhu, H.; Xiu, Y.; Li, Y.L.; Lu, C. AlphaPose: Whole-Body Regional Multi-Person Pose Estimation and Tracking in Real-Time. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 7157–7173. [Google Scholar] [CrossRef]
- Yu, C.; Xiao, B.; Gao, C.; Yuan, L.; Zhang, L.; Sang, N.; Wang, J. Lite-HRNet: A Lightweight High-Resolution Network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 10440–10450. [Google Scholar]
- Guns-Knives Object Detection Dataset. Available online: https://www.kaggle.com/datasets/iqmansingh/guns-knives-object-detection (accessed on 14 June 2023).
- Baseball Bat Dataset. Available online: https://images.cv/dataset/baseball-bat-image-classification-dataset (accessed on 14 June 2023).
- Narejo., S.; Pandey, B.; Esenarro Vargas, D.; Rodriguez, C.; Anjum, M.R. Weapon Detection Using YOLO V3 for Smart Surveillance System. Math. Probl. Eng. 2021, 2021, 9975700. [Google Scholar] [CrossRef]
- Song, J.; Meng, C.; Ermon, S. Denoising Diffusion Implicit Models. In Proceedings of the International Conference on Learning Representations, Addis Ababa, Ethiopia, 30 April 2022; pp. 26–30. [Google Scholar]
- Gu, S.; Chen, D.; Bao, J.; Wen, F.; Zhang, B.; Chen, D.; Yuan, L.; Guo, B. Vector quantized diffusion model for text-to-image synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 10696–10706. [Google Scholar]
- Saharia, C.; Chan, W.; Saxena, S.; Li, L.; Whang, J.; Denton, E.L.; Ghasemipour, K.; Gontijo Lopes, R.; Karagol Ayan, B.; Salimans, T.; et al. Photorealistic text-to-image diffusion models with deep language understanding. In Advances in Neural Information Processing Systems (NeurIPS); Red Hook Inc.: Brooklyn, NY, USA, 2022; Volume 35, pp. 36479–36494. [Google Scholar]
- Hemmatirad, K.; Babaie, M.; Afshari, M.; Maleki, D.; Saiadi, M.; Tizhoosh, H.R. Quality Control of Whole Slide Images using the YOLO Concept. In Proceedings of the IEEE 10th International Conference on Healthcare Informatics (ICHI), Rochester, MN, USA, 11–14 June 2022; pp. 282–287. [Google Scholar]
- Wang, Y.; Wang, H.; Xin, Z. Efficient Detection Model of Steel Strip Surface Defects Based on YOLO-V7. IEEE Access 2022, 10, 133936–133944. [Google Scholar] [CrossRef]
- Kaiyue, L.; Qi, S.; Daming, S.; Lin, P.; Mengduo, Y.; Nizhuan, W. Underwater Target Detection Based on Improved YOLOv7. J. Mar. Sci. Eng. 2023, 3, 677. [Google Scholar]
- Kumar, P.; Shih, G.-L.; Yao, C.-K.; Hayle, S.T.; Manie, Y.C.; Peng, P.-C. Intelligent Vibration Monitoring System for Smart Industry Utilizing Optical Fiber Sensor Combined with Machine Learning. Electronics 2023, 12, 4302. [Google Scholar] [CrossRef]
- Chen, K.-Y.; Shin, J.; Hasan, M.A.M.; Liaw, J.-J.; Yuichi, O.; Tomioka, Y. Fitness Movement Types and Completeness Detection Using a Transfer-Learning-Based Deep Neural Network. Sensors 2022, 22, 5700. [Google Scholar] [CrossRef]
- MediaPipe: Pose Landmark Detection Guide. Available online: https://developers.google.com/mediapipe (accessed on 14 November 2023).
- Zeng, Y.; Ye, W.; Stutheit-Zhao, E.Y.; Han, M.; Bratman, S.V.; Pugh, T.J.; He, H.H. MEDIPIPE: An automated and comprehensive pipeline for cfMeDIP-seq data quality control and analysis. Bioinformatics 2023, 39, btad423. [Google Scholar] [CrossRef]
- Staudemeyer, R.C.; Morris, E.R. Understanding LSTM—A Tutorial into Long Short-Term Memory Recurrent Neural Networks. arXiv 2019, arXiv:1909.09586. [Google Scholar]
- Zhou, C.; Sun, C.; Liu, Z.; Lau, F.C.M. A C-LSTM Neural Network for Text Classification. arXiv 2015, arXiv:1511.08630. [Google Scholar]
- Ghourabi, A.; Mahmood, M.A.; Alzubi, Q.M. A Hybrid CNN-LSTM Model for SMS Spam Detection in Arabic and English Messages. Future Internet 2020, 12, 156. [Google Scholar] [CrossRef]
- Mittal, S. A Survey on Optimized Implementation of Deep Learning Models on the NVIDIA Jetson Platform. J. Syst. Archit. 2019, 97, 428–442. [Google Scholar] [CrossRef]
- Shi, Z. Optimized Yolov3 Deployment on Jetson TX2 with Pruning and Quantization. In Proceedings of the 2021 IEEE 3rd International Conference on Frontiers Technology of Information and Computer (ICFTIC), Greenville, SC, USA, 12–14 November 2021; pp. 62–65. [Google Scholar]
- Chumuang, N.; Hiranchan, S.; Ketcham, M.; Yimyam, W.; Pramkeaw, P.; Tangwannawit, S. Developed Credit Card Fraud Detection Alert Systems via the Notification of LINE Application. In Proceedings of the 2020 15th International Joint Symposium on Artificial Intelligence and Natural Language Processing (iSAI-NLP), Bangkok, Thailand, 18–20 November 2020; pp. 1–6. [Google Scholar]
- Kumar, P.; Li, C.-Y.; Guo, B.-L.; Manie, Y.C.; Yao, C.-K.; Peng, P.-C. Detection of Acrimonious Attacks using Deep Learning Techniques and Edge Computing Devices. In Proceedings of the 2023 International Conference on Consumer Electronics—Taiwan (ICCE-Taiwan), Pingtung, Taiwan, 9–11 July 2023; pp. 407–408. [Google Scholar]
- Tang, Y.; Chen, Y.; Sharifuzzaman, S.A.; Li, T. An automatic fine-grained violence detection system for animation based on modified faster R-CNN. Expert Syst. Appl. 2024, 237, 121691. [Google Scholar] [CrossRef]
- Tufail, H.; Nazeef, U.H.; Muhammad, F.; Muhammad, S. Application of Deep Learning for Weapons Detection in Surveillance Videos. In Proceedings of the 2021 International Conference on Digital Futures and Transformative Technologies (ICoDT2), Islamabad, Pakistan, 20–21 May 2021; pp. 1–6. [Google Scholar]
Class Label | Number of Images (Real + Generated) |
---|---|
Baseball bat | 700 + 300 |
Gun | 700 + 300 |
Knife | 700 + 300 |
Total | 3000 |
Training size | 2100 (70%) |
Validation size | 900 (30%) |
Parameters | Value |
---|---|
Learning rate | 1 × 10−5 |
Momentum | 0.98 |
Weight decay | 0.001 |
Batch size | 16 |
Optimizer | Adam |
Dimensions | 416 × 416 |
Epochs | 200 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Kumar, P.; Shih, G.-L.; Guo, B.-L.; Nagi, S.K.; Manie, Y.C.; Yao, C.-K.; Arockiyadoss, M.A.; Peng, P.-C. Enhancing Smart City Safety and Utilizing AI Expert Systems for Violence Detection. Future Internet 2024, 16, 50. https://doi.org/10.3390/fi16020050
Kumar P, Shih G-L, Guo B-L, Nagi SK, Manie YC, Yao C-K, Arockiyadoss MA, Peng P-C. Enhancing Smart City Safety and Utilizing AI Expert Systems for Violence Detection. Future Internet. 2024; 16(2):50. https://doi.org/10.3390/fi16020050
Chicago/Turabian StyleKumar, Pradeep, Guo-Liang Shih, Bo-Lin Guo, Siva Kumar Nagi, Yibeltal Chanie Manie, Cheng-Kai Yao, Michael Augustine Arockiyadoss, and Peng-Chun Peng. 2024. "Enhancing Smart City Safety and Utilizing AI Expert Systems for Violence Detection" Future Internet 16, no. 2: 50. https://doi.org/10.3390/fi16020050