RESEARCH ON THE EFFECTIVENESS OF DEEP LEARNING METHODS IN IMAGE AND VIDEO PROCESSING TASKS
26.11.2023 12:05
[1. Информационные системы и технологии]
Автор: Yurii Dmytrovych Ihnatchuk, Computer Science Department, Lviv Polytechnic National University, Lviv, Ukraine
Abstract — In recent decades, deep learning has become an important branch of artificial intelligence and computer vision, contributing significantly to the improvement of automatic object recognition, image segmentation, video analysis, and many other aspects of visual information processing. This paper reviews different neural network architectures, including convolutional networks (CNNs), recurrent neural networks (RNNs), and combinatorial models, and investigates their performance in image and video processing tasks. The authors offer conclusions based on the results of experiments and comparisons with existing methods. In addition, the paper considers the practical implementation of deep learning in large volumes of data and on different computing platforms, taking into account the requirements for speed and accuracy. The research presented in this article is aimed at developing the scientific foundations of deep learning in the context of image and video processing, as well as at the practical implementation of the obtained results in various fields, including medicine, autonomous transport, video surveillance and many others. The results of this research open up new opportunities for improving and automating visual data processing processes and ensuring their effective analytics.
Keywords— neural network, video and image processing, machine learning.
I.INTRODUCTION
In recent decades, deep learning has set a new standard in image and video processing, superseding traditional methods and providing significant improvements in accuracy and speed of information processing. This technology has become an important tool in areas where the analysis of large volumes of visual data is required to achieve high levels of performance and accuracy.
Image and video processing tasks seemed extremely challenging because of the variety of scenes, objects, and contexts that can be encountered in the real world. However, thanks to deep learning and in particular convolutional neural networks (CNNs), significant advances have been made in various aspects of visual information processing.
Research concerning the effectiveness of deep learning methods in the context of image and video processing is becoming more and more relevant, as these techniques open new opportunities for automating the processes of analysis and understanding of visual data. Simplifying the tasks of segmentation, classification, object detection, and video compression through deep learning has led to widespread applications in fields such as medicine, computer vision, video surveillance, robotics, and many others.
This paper is dedicated to studying the effectiveness and application of deep learning techniques in various aspects of image and video processing. We will discuss the key achievements, challenges and prospects of using deep learning in the above tasks and highlight its practical value in today's world, where visual information is becoming increasingly important for decision making and solving complex tasks.
II. LITERATURE REVIEW
Image and video processing tasks use different deep learning methods, mainly using RNN and CNN, or hybrid models. This literature review discusses the approaches described in the image and video processing articles and their effectiveness.
In publication [1], a comprehensive overview is presented, highlighting the recent advancements in clinical applications of deep learning techniques based on Convolutional Neural Networks (CNN). These techniques are applied to various tasks, including image classification, object detection, segmentation, and registration within the medical field. A more detailed examination is conducted on image analysis-based diagnostic applications across four major systems of the human body: the nervous system, the cardiovascular system, the digestive system, and the skeletal system.
Article [2] provides an overview of deep learning techniques used in video compression. It discusses the challenges associated with video compression and the potential benefits of using deep learning for this task. The paper also describes various deep learning architectures used for video compression, including autoencoders, convolutional neural networks (CNNs), and recurrent neural networks (RNNs). In addition, it covers various applications of deep learning in video compression, such as video encoding, video restoration, and video analysis.
In publication [3], a novel approach is presented for detecting COVID-19 using a stacking ensemble deep learning model that leverages both COVID-19 symptoms and chest X-ray images. This study introduces two distinct models, each tailored to specific datasets: one based on COVID-19 symptoms and another on chest X-ray images. The first model is a fusion of four pre-trained deep learning models, namely MLP, RNN, LSTM, and GRU, which are combined in a stacking configuration. This arrangement allows for the training and evaluation of a meta-learner to make the final predictions. Notably, when compared to other deep learning models utilizing two different COVID-19 symptom datasets, our proposed model demonstrated superior performance, with impressive metrics (A = 99.30, P = 99.30, R = 99.30, and F1 = 99.31). The second model integrates the outputs of pre-trained models including ResNet152V2, DenseNet201, VGG16, MobileNetV2, and inception_v3i, using a stacking approach to process chest X-ray datasets. A meta-learner (SVM) is employed to make the final predictions.
Article [4] provides an overview of deep learning methods and their application in medical image processing. The paper examines various deep learning models such as convolutional neural networks (CNNs), recurrent neural networks (RNNs) and generative adversarial networks (GANs) and discusses how they can be used for tasks such as classification, segmentation, registration and synthesis images. The paper also discusses the challenges and limitations of deep learning in medical image processing and provides recommendations for future research.
III. MODELS OVERVIEW
A. Overview of the CNN Model
Convolutional Neural Networks (CNNs) have revolutionized the fields of image and video processing. These neural networks are specifically designed to process visual data efficiently, making them a fundamental component in a wide range of applications, from image classification to video analysis. Here is an overview of CNN models for image and video processing:
1. Convolutional Layers: CNNs use convolutional layers to detect local patterns and features in images. Convolutional filters slide over the input image, performing element-wise multiplication and aggregating local information. Multiple convolutional layers are stacked to capture increasingly complex features.
2. Pooling Layers: After each set of convolutional layers, pooling layers are typically added. Pooling reduces the spatial dimensions of the feature maps, making the network more computationally efficient and robust to small variations in the input.
3. Activation Functions: Activation functions, such as ReLU (Rectified Linear Unit), introduce non-linearity into the network. They help CNNs model complex relationships in the data.
4. Preprocessing: CNNs often require preprocessing steps like resizing images to a consistent input size or normalizing pixel values to enhance their performance.
5. Image Classification: CNNs excel in image classification tasks. They can identify and classify objects within an image, making them valuable in applications like autonomous driving, medical imaging, and more.
6. Video Processing: CNNs are extended to video analysis by treating video frames as a sequence of images. Models like 3D CNNs or 2D CNNs with temporal processing can handle video data, enabling applications in action recognition, surveillance, and video summarization.
Convolutional Neural Networks have drastically advanced image and video processing capabilities, driving progress in computer vision applications across various domains. Their adaptability and the ongoing development of related techniques promise even greater potential in the future [7].
Fig. 1. Architecture of the CNN model.[7]
B. Overview of the RNN Model
Recurrent Neural Networks (RNNs) are a class of neural networks that are primarily associated with sequential data and have been widely used in natural language processing and time series analysis. While CNNs are more common for image and video processing, RNNs can also be applied to these domains, particularly in video analysis. Here's an overview of how RNNs are used for image and video processing:
1. Sequence Modeling: RNNs are designed to model sequential data, making them well-suited for video analysis where frames are processed sequentially. Each frame in a video is treated as a time step, allowing RNNs to capture temporal dependencies.
2. Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU): To address the vanishing gradient problem and improve the modeling of long-range dependencies, variants of RNNs like LSTMs and GRUs are commonly used in image and video processing. These units maintain a memory of previous time steps and selectively update and forget information.
3. Video Classification: RNNs can be used for video classification tasks, such as action recognition. By processing a sequence of video frames through an RNN, it's possible to classify the actions or activities occurring in the video.
4. Video Generation: RNNs can also be employed for video generation. By training an RNN on a dataset of video sequences, it's possible to generate new, similar video sequences. This has applications in video synthesis, animation, and more.
While CNNs dominate image processing, RNNs have found their place in video analysis by capturing temporal patterns and enabling tasks that require understanding the sequential nature of video data. Their effective use often involves combining them with other neural network architectures for more comprehensive solutions.[5]
C. Overview of the Hybrid Model (CNN and RNN) for Image and Video Processing
Hybrid models that combine Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs) have become increasingly popular in the fields of image and video processing. These models leverage the strengths of both CNNs and RNNs to address a wide range of tasks. Here's an overview of the hybrid model for image and video processing:
1. CNN for Feature Extraction: CNNs are highly effective in extracting spatial features from images. In a hybrid model, the first part of the architecture typically consists of one or more CNN layers. These layers are responsible for capturing local patterns and high-level features from individual frames in the case of video processing. For image processing, they extract relevant visual features.
2. RNN for Temporal Modeling: The output from the CNN layers is passed to the RNN component of the model. RNNs, such as LSTMs or GRUs, are designed to handle sequential data, making them ideal for capturing temporal information in videos. The RNN processes the extracted features across time steps, modeling temporal dependencies in the data.
3. Video Analysis: In the context of video processing, hybrid models can perform various tasks, including action recognition, video captioning, and video generation. The CNN captures spatial features in individual frames, and the RNN processes the temporal dynamics, enabling the model to understand actions and events.
4. Image Understanding: For image processing, hybrid models can enhance image understanding. By combining CNNs for feature extraction with RNNs for context and sequential understanding, these models can be used in applications like image captioning, where the model generates descriptive text based on the content of an image.
Hybrid CNN-RNN models have proven to be versatile and powerful in addressing a wide range of image and video processing tasks, where both spatial and temporal information is critical. Their ability to combine the strengths of CNNs for feature extraction and RNNs for temporal modeling makes them valuable tools for understanding and analyzing visual data.[6]
Fig. 3. Architecture of the hybrid model. [6]
IV. EXPERIMENTAL RESEARCH
A. Data
In this work, the UCF101 Videos dataset was used to train the models. The UCF101 dataset comprises a collection of videos spanning 101 action categories, making it a valuable resource for training and evaluating action recognition algorithms. Each video in the dataset is labeled with one of the 101 action classes, such as "playing soccer," "brushing teeth," "swinging baseball bat," and many more. This dataset is well-suited for a variety of research and development tasks related to video action recognition.
The UCF101 dataset provides a rich source of video data, and its diversity in action categories allows for the testing and validation of action recognition models across a wide spectrum of activities. Researchers and practitioners in the field of computer vision and video analysis frequently use the UCF101 dataset as a benchmark for evaluating the performance of video action recognition algorithms.
B. Making Predictions
The Hybrid (CNN and RNN) system in image and video processing tasks combines the features extracted by both the CNN and RNN components. This fusion is typically performed to leverage the advantages of both architectures, enabling comprehensive image and video analysis. The approach used in this work is to balance the contributions of CNN (0.7) and RNN (0.3) features. Initially, the weights are adjusted to control the influence of CNN and RNN in the final feature representation (1).
0.7 * The CNN extracts image and video features, resulting in feature vectors.
0.3 * The RNN processes sequential information in videos, extracting temporal patterns and generating feature sequences.
These feature vectors and sequences are then concatenated or merged to create a unified feature representation for further analysis or classification tasks (1).
The combined features are used for various image and video processing tasks such as object detection, action recognition, video segmentation, and more. These hybrid features provide a rich representation that captures spatial and temporal information, making them valuable for a wide range of computer vision tasks. The unified features are processed and sorted based on the specific image or video processing task, leading to improved performance and accuracy in tasks that benefit from both spatial and temporal context.
C. Experiment Evaluation
The evaluation of the effectiveness of deep learning methods in image and video processing tasks was conducted using several metrics, with a focus on accuracy. The primary metrics considered were precision, recall, mean squared error, mean absolute error, and area under the ROC curve (AUC).
Results
The hybrid model, which combines CNN and RNN components, demonstrated impressive results:
TABLE I. Hybrid MODEL EVALUATION
In comparison, the standalone CNN and RNN models showed the following results:
TABLE II. CNN MODEL EVALUATION
TABLE III. RNN MODEL EVALUATION
In the context of image and video processing tasks, accuracy is of utmost importance. The hybrid model, leveraging both CNN and RNN features, achieved outstanding accuracy (1.0000) on the training set, emphasizing its capability to effectively process and classify complex patterns in images and videos. This superior accuracy is crucial for applications demanding high precision in tasks such as object detection, action recognition, and video segmentation.
While the CNN model demonstrated good accuracy (0.9400), and the RNN model showed limited accuracy (0.1175), the hybrid model outperforms both in terms of accuracy, making it a promising approach for comprehensive image and video processing tasks. The validation metrics further underscore the robustness of the hybrid model, indicating its potential for real-world applications.
VI.CONCLUSION
In this comprehensive exploration of deep learning methods for image and video processing tasks, we investigated the performance of individual and hybrid models. The findings underscore the critical role of hybrid models, combining Convolutional Neural Networks (CNN) and Recurrent Neural Networks (RNN), in achieving superior accuracy and robustness in complex tasks.
Key Takeaways:
•Hybrid Model Dominance: The hybrid model, synergizing CNN and RNN features, emerged as the top performer. Its accuracy of 1.0000 on the training set highlights its proficiency in recognizing intricate patterns within images and videos. This makes it a promising solution for applications demanding high precision.
•CNN Model Strengths: The standalone CNN model exhibited commendable accuracy (0.9400) on the training set, demonstrating its efficacy in capturing spatial features. However, its performance on the validation set suggests a need for further optimization to enhance generalizability.
•Challenges with RNN: The RNN model, focused on temporal dependencies, faced challenges, reflected in its lower accuracy (0.1175). This indicates room for improvement in handling sequential information in image and video data.
Fine-Tuning Insights:
Fine-tuning experiments, including optimizer selection, revealed that the SGD optimizer provided the best results for the hybrid model. This optimizer, balancing precision and recall, contributed to the overall effectiveness of the model.
Implications and Future Directions:
•Real-World Applicability: The hybrid model's impressive accuracy positions it as a valuable tool for real-world applications such as object recognition, action detection, and video segmentation.
•Optimization Opportunities: Further optimization opportunities exist, including hyperparameter tuning (e.g., batch size, epochs), adversarial training, and exploring early stopping techniques to enhance model performance.
This study not only contributes valuable insights into the efficacy of deep learning methods for image and video processing but also emphasizes the potential of hybrid models. As technology advances and datasets evolve, ongoing research and refinement will be essential to unlock the full potential of deep learning in addressing the intricacies of visual data. The journey continues towards creating more robust, accurate, and versatile models for the dynamic landscape of image and video processing tasks.
REFERENCES
[1]Y. Zhang, S. Kwong, L. Xu, і T. Zhao, «Advances in Deep-Learning-Based Sensing, Imaging, and Video Processing», Sensors.
[2]W. Saideni, D. Helbert, F. Courreges, і J.-P. Cances, «An Overview on Deep Learning Techniques for Video Compressive Sensing», Applied Sciences (Switzerland).
[3]K. Kc, Z. Yin, M. Wu, і Z. Wu, «Evaluation of deep learning-based approaches for COVID-19 classification based on chest X-ray images», Signal, Image and Video Processing.
[4]A. Maier, C. Syben, T. Lasser, і C. Riess, «A gentle introduction to deep learning in medical image processing», Zeitschrift fur Medizinische Physik.
[5]S. Islam, A. Dash, A. Seum, A. H. Raj, T. Hossain, і F. M. Shah, «Exploring Video Captioning Techniques: A Comprehensive Survey on Deep Learning Methods».
[6]«MRI to MGMT: predicting methylation status in glioblastoma patients using convolutional recurrent neural networks | Biocomputing 2018»
[7]S. Shah, «Convolutional Neural Network: An Overview», Analytics Vidhya. Application date: November 14, 2023. [Online]. Available in: https://www.analyticsvidhya.com/blog/2022/01/convo