Intelligent Video Surveillance using Object Detection

January 8, 2025
15 Min
In-Video AI
Jump to
Share
This is some text inside of a div block.

Object detection plays a crucial role in transforming traditional video surveillance into intelligent systems capable of recognizing and tracking objects in real time. This blog post dives deep into the mechanics of object detection for video surveillance, focusing on the techniques, tools, frameworks, and practical applications.

What is intelligent video surveillance?

Intelligent video surveillance refers to the use of AI and machine learning technologies to analyze video streams from security cameras in real time. Unlike traditional systems that rely on human operators, intelligent systems can automatically detect and respond to events, improving accuracy and reducing response times.

Role of object detection in video surveillance

Object detection is the core technology behind intelligent video surveillance. It involves identifying and classifying objects in video frames, such as people, vehicles, or other entities, and tracking their movement. This technology is essential for various applications, including anomaly detection, traffic monitoring, and crowd management.

What is object detection?

What is Object Detection?

Understanding the fundamentals of object detection

Before diving deep into the topic, it is important to understand a few fundamental concepts.

Classification: Image classification is the task of assigning a label or class to an entire image by assigning a probability to each class. It's a key task in computer vision and is often performed using classification networks, such as Convolutional Neural Networks (CNNs). Image classification can be used in a variety of contexts such as remote sensing, computer vision etc.,

Understanding the Fundamentals

Localization: Draws a bounding box around an object in an image to specify its location. Classification with localization not only classifies the object in the image but also localizes it in the image determining its bounding box.

source: pexels

What is a bounding box?

A bounding box is a rectangular region around an object in an image that’s used to identify and locate the object in computer vision tasks

Object detection: Object detection is a computer vision technique that identifies and locates objects in images or videos. It’s a combination of object localization and classification, where the model determines the location of an object or multiple objects and which category it belongs to. It is used in many applications such as video surveillance, self-driving cars, medical imaging, etc.,

Instance segmentation: It is a computer vision technique that identifies and classifies each object in an image by assigning a unique label to each pixel. It’s a combination of object detection and semantic segmentation and provides a more detailed output than either of those techniques.

Object tracking: It is a computer vision technique that uses deep learning to automatically identify and track objects in video or images. Object tracking algorithms start by detecting objects in an image or video, then assign a unique identifier to each object. The algorithm then tracks the objects as they move through the video, estimating their position and other relevant information.

Popular object detection algorithms:

Deep learning has significantly advanced the field of object detection. Here are some of the most used object detection algorithms based on deep learning:

1. Region-based Convolutional Neural Networks (R-CNN) Family

  • R-CNN (Region-based Convolutional Neural Network): The first approach that used deep learning for object detection. It generates region proposals and classifies each one using a convolutional neural network (CNN). While effective, R-CNN is slow because it processes each region's proposal independently.
  • Fast R-CNN: Improves the speed of R-CNN by feeding the entire image to a CNN to generate a feature map, then using region proposals to extract regions from this map.
  • Faster R-CNN: Further speeds up the process by introducing the Region Proposal Network (RPN), which directly predicts region proposals from the feature map.
  • Mask R-CNN: An extension of Faster R-CNN that adds a branch for predicting segmentation masks on each detected object, making it suitable for instance segmentation as well.

Region-based Convolutional Neural Networks (R-CNN) Family
source: researchgate

2. Single shot multiBox detector (SSD)

SSD performs object detection in a single pass through the network, making it much faster than R-CNN models. It divides the image into a grid and predicts bounding boxes and class probabilities directly from each grid cell. It's well-suited for real-time applications.

3. You only look once (YOLO) family

  • YOLO: Frames object detection as a single regression problem, predicting bounding boxes and class probabilities from the entire image in one evaluation. Known for its speed, it’s perfect for real-time applications.
  • YOLOv1 to YOLOv7 and beyond: Successive versions enhance model size, computational efficiency, and processing capabilities.

arxiv.org
source: arxiv.org

source: researchgate.net

4. Vision transformers (ViTs) and DETR

  • DETR: Utilizes transformers for object detection, treating it as a set prediction problem. This model simplifies the detection process by eliminating regional proposals and anchor boxes.
Source: https://arxiv.org/pdf/2005.12872v3


  • Vision transformers: Initially designed for image classification, ViTs have been adapted for object detection, often incorporating attention mechanisms for improved accuracy. You can learn more on this here

5. Swin transformer

A hierarchical vision transformer that excels in dense prediction tasks, including object detection, Swin Transformer enhances efficiency and scalability with its unique architecture.

source: arxiv.org

6. CLIP (Contrastive language-image pretraining)

While not specifically designed for object detection, OpenAI's CLIP learns visual concepts from natural language descriptions, enabling zero-shot classification and generalized visual understanding. This capability can be integrated with detection frameworks for versatile, multi-modal tasks. You can learn more here.

7. Faster models for edge devices

  • Tiny YOLO: A scaled-down version of YOLO designed for resource-constrained devices like mobile phones and embedded systems.
  • MobileNet SSD: Combines MobileNet (a lightweight CNN) with SSD for real-time object detection on devices with limited processing power.

Key considerations:

  • Speed vs. Accuracy: One-stage detectors like YOLO and SSD are faster and suitable for real-time applications, while two-stage detectors like Faster R-CNN are more accurate but slower.
  • Size of objects: Algorithms like RetinaNet and EfficientDet handle small object detection better due to their unique architectures.
  • Hardware requirements: Transformers like DETR are computationally intensive but provide state-of-the-art results in object detection.

Step by step approach to implement intelligent video surveillance system using object detection

1. Setup and requirements

  • Hardware: Use camera setup (IP cameras, CCTV, or webcams) for video input.
  • Software: Use object detection models and machine learning frameworks like TensorFlow, OpenCV, or PyTorch.
  • Environment: You can run the setup on a local server, cloud server, or an edge device like NVIDIA Jetson for real-time processing.

2. Choose an object detection model

Select from pre-trained models based on your performance requirements and computational resources:

  • YOLO (You Only Look Once): Fast and accurate, suitable for real-time detection.
  • SSD (Single Shot MultiBox Detector): Good for speed and accuracy balance.
  • Faster R-CNN: Highly accurate but slower compared to YOLO and SSD.

3. Data collection and preprocessing

  • If you're not using a pre-trained model, gather data specific to your application.
  • Preprocess the data by labeling objects in images or video frames using tools like LabelImg.
  • Resize and normalize the data to match the input requirements of your chosen object detection model.

4. Training the object detection model (optional)

  • Fine-tune a pre-trained model on your specific dataset if needed.
  • Use transfer learning techniques to speed up the training process.

5. Real-time object detection

  • Use OpenCV to capture live video feeds and process frames.
  • Load the trained object detection model and run it on each video frame.
  • Implement object detection by drawing bounding boxes and labels on detected objects.

Here's a Python code snippet using YOLO with OpenCV:

1import cv2 
2
3  
4
5# Load YOLO model 
6
7net = cv2.dnn.readNet('yolov3.weights', 'yolov3.cfg') 
8
9layer_names = net.getLayerNames() 
10
11output_layers = [layer_names[i[0] - 1] for i in net.getUnconnectedOutLayers()] 
12
13  
14
15# Capture video feed from camera 
16
17cap = cv2.VideoCapture(0) 
18
19  
20
21while True: 
22
23    _, frame = cap.read() 
24
25    height, width, _ = frame.shape 
26
27  
28
29    # Prepare the frame for the model 
30
31    blob = cv2.dnn.blobFromImage(frame, 0.00392, (416, 416), (0, 0, 0), True, crop=False) 
32
33    net.setInput(blob) 
34
35    outs = net.forward(output_layers) 
36
37  
38
39    # Process detections 
40
41    for out in outs: 
42
43        for detection in out: 
44
45            scores = detection[5:] 
46
47            class_id = np.argmax(scores) 
48
49            confidence = scores[class_id] 
50
51            if confidence > 0.5: 
52
53                center_x = int(detection[0] * width) 
54
55                center_y = int(detection[1] * height) 
56
57                w = int(detection[2] * width) 
58
59                h = int(detection[3] * height) 
60
61                x = int(center_x - w / 2) 
62
63                y = int(center_y - h / 2) 
64
65                cv2.rectangle(frame, (x, y), (x + w, y + h), (0, 255, 0), 2) 
66
67  
68
69    cv2.imshow('Video Surveillance', frame) 
70
71    if cv2.waitKey(1) == 27:  # Press 'ESC' to exit 
72
73        break 
74
75 
76cap.release() 
77
78cv2.destroyAllWindows() 

6. Multi-object tracking with Deep SORT  

Now, we'll integrate Deep SORT to track these detected objects. Make sure you have the deep_sort_realtime library installed. If not, you can install it using:

pip install deep-sort-realtime

Below is the code to implement object tracking using Deep SORT with the detected bounding boxes from YOLO.

1import cv2 
2import numpy as np 
3
4from deep_sort_realtime.deepsort_tracker import DeepSort 
5
6# Initialize Deep SORT tracker 
7tracker = DeepSort(max_age=30, n_init=3, nms_max_overlap=1.0, max_cosine_distance=0.4) 
8
9# Load YOLO model 
10net = cv2.dnn.readNet('yolov3.weights', 'yolov3.cfg') 
11layer_names = net.getLayerNames() 
12output_layers = [layer_names[i[0] - 1] for i in net.getUnconnectedOutLayers()] 
13classes = open('coco.names').read().strip().split('\n') 
14
15cap = cv2.VideoCapture(0) 
16
17while True: 
18    ret, frame = cap.read() 
19    height, width, _ = frame.shape 
20    
21    blob = cv2.dnn.blobFromImage(frame, 0.00392, (416, 416), (0, 0, 0), True, crop=False) 
22    net.setInput(blob) 
23    outs = net.forward(output_layers) 
24
25    detections = [] 
26
27    for out in outs: 
28        for detection in out: 
29            scores = detection[5:] 
30            class_id = np.argmax(scores) 
31            confidence = scores[class_id] 
32
33            if confidence > 0.5: 
34                center_x = int(detection[0] * width) 
35                center_y = int(detection[1] * height) 
36                w = int(detection[2] * width) 
37                h = int(detection[3] * height) 
38                x = int(center_x - w / 2) 
39                y = int(center_y - h / 2) 
40                detections.append([x, y, w, h, confidence, class_id]) 
41
42    # Update tracker with the new detections 
43
44    tracks = tracker.update_tracks(detections, frame=frame) 
45
46    # Loop over the tracks and draw them on the frame 
47
48    for track in tracks: 
49
50        if not track.is_confirmed(): 
51
52            continue 
53            
54        track_id = track.track_id 
55        ltrb = track.to_ltrb() 
56        x1, y1, x2, y2 = map(int, ltrb) 
57        cv2.rectangle(frame, (x1, y1), (x2, y2), (0, 255, 0), 2) 
58        cv2.putText(frame, f'ID: {track_id}', (x1, y1 - 10), cv2.FONT_HERSHEY_SIMPLEX, 0.5, (255, 255, 255), 2) 
59
60    cv2.imshow('Deep SORT Object Tracking', frame) 
61    
62    if cv2.waitKey(1) == 27:  # Press 'ESC' to exit 
63
64        break 
65        
66cap.release() 
67
68cv2.destroyAllWindows() 

Explanation of the Deep SORT implementation

  • detections: An array of detections containing bounding box coordinates, confidence, and class ID from the YOLO object detection step.
  • tracker.update_tracks(): This method updates the tracker with new detections, assigning consistent IDs to objects and predicting their next positions.
  • track.is_confirmed(): Ensures that only confirmed tracks are drawn to reduce noise and false detections.

7. Alert mechanism and event detection

  • Set up rules for generating alerts when specific events occur (e.g., detecting a person in a restricted area).
  • Integrate with messaging or alert systems like email, SMS, or IoT alarms.

8. Data storage and management

  • Store the detected events in a database for future analysis.
  • Use cloud storage services like AWS S3 or databases like ScyllaDB for efficient data handling.

9. Integration with analytics and reporting tools

  • Create dashboards to visualize real-time data and historical trends.
  • Integrate with analytics tools to extract insights from the surveillance data.

10. Optimization and scaling

  • Optimize the detection model to reduce latency and improve accuracy.
  • Consider using edge computing for real-time processing at the device level.
  • Scale the system to handle multiple cameras or locations as needed.

This approach will help you create a comprehensive intelligent video surveillance system that leverages object detection to provide accurate and actionable insights in real time.

As we have implemented Deep SORT let's also explore the object tracking techniques in detail:

Object tracking techniques

Deep SORT (Simple Online and Realtime Tracking)

  • Deep SORT extends the traditional SORT algorithm with a deep learning component that improves object association across frames, making it suitable for scenarios with crowded environments.
  • It uses a combination of motion and appearance information to track objects, which reduces ID switches (when the identity of tracked objects changes unexpectedly).
  • Components of Deep SORT:
    • Kalman filter: Predicts the next position of the object based on its past positions.
    • Hungarian algorithm: Matches the detected objects with the tracked objects using a cost matrix.
    • Appearance descriptor: A deep neural network generates a feature vector for each detected object to assist in re-identification.
  • Libraries: You can use the deep_sort_realtime package, which simplifies integration with Python.

Kalman filters

  • Kalman Filters are a mathematical tool used for estimating the state of a dynamic system, which in this case, is the position and velocity of the detected objects.
  • They predict an object's future location, adjust the prediction based on measurements (detections), and refine their estimates to handle noisy data effectively.
  • Kalman Filters work well when the motion of the objects is linear and predictable.
  • Applications: Suitable for basic object tracking tasks where complex re-identification is not required.

CSRT tracker (Channel and spatial reliability tracker)

  • CSRT Tracker is a high-accuracy tracker available in OpenCV, which performs well on small and non-rigid objects.
  • It excels in handling scale changes, occlusions, and rotations.
  • It's computationally more expensive than some simpler trackers like KCF (Kernelized Correlation Filters) but provides better performance.
  • Applications: Useful when you need precision in object tracking and have sufficient computational power.

Most used metrics in object detection:

1. Intersection over Union (IoU)

  • Definition: IoU measures the overlap between the predicted bounding box and the ground truth bounding box. It is calculated as:
Intersection over Union (IoU)

2. Precision and recall

  • Precision: Precision measures the accuracy of the positive predictions made by the model. It is defined as:

Precision measures the accuracy of the positive predictions made by the model.
  • Recall: Recall measures the model's ability to find all relevant objects. It is defined as:

Recall measures the model's ability to find all relevant objects
  • Interpretation: High precision means that the model makes fewer mistakes in predicting objects, while high recall means it detects most of the objects present.

3. Average precision (AP)

  • AP calculation: Average Precision is the area under the Precision-Recall curve, which plots precision against recall at different IoU thresholds.
  • AP per Class: It is often computed for each class separately, and the mean value across all classes is used as a summary metric.
  • Interpretation: AP is a commonly used metric for evaluating object detection models, as it provides a balanced measure of precision and recall.

4. Mean average precision (mAP)

  • Definition: mAP is the mean of the Average Precision (AP) values for all object classes in the dataset. It is the primary metric used to evaluate object detection models.
  • IoU Thresholds: mAP is often computed at multiple IoU thresholds (e.g., mAP@0.5 and mAP@[0.5:0.95]). For instance:
    • mAP@0.5 (COCO metric): This measures mAP at an IoU threshold of 0.5, meaning that a detection is considered correct if its IoU with the ground truth is greater than 0.5.
    • mAP@[0.5:0.95] (COCO average): This metric averages mAP across IoU thresholds ranging from 0.5 to 0.95 in increments of 0.05, providing a more stringent and robust evaluation.

5. F1 score

  • Definition: The F1 score is the harmonic mean of precision and recall, giving a balanced measure that considers both metrics:

F1 Score
  • Interpretation: It provides a single score that balances the trade-off between precision and recall. A high F1 score indicates a good balance between both metrics.

6. Precision-recall curve

  • Description: This curve plots precision against recall at various threshold settings. A model with a precision-recall curve that hugs the upper right corner of the graph is considered to have better performance.
  • Area under curve (AUC): The area under the precision-recall curve can be used as a metric to summarize the trade-off between precision and recall.

7. FPS (Frames Per Second)

  • Definition: FPS measures how many frames per second the object detection model can process. It is an important metric for real-time applications.
  • Interpretation: Higher FPS indicates faster inference time, making it suitable for applications where speed is critical, like video surveillance or autonomous driving.

8. True Positives, False Positives, and False Negatives

Object detection

  • True positives (TP): Correctly detected objects.
  • False positives (FP): Objects that the model incorrectly detects (i.e., there is no corresponding ground truth object).
  • False negatives (FN): Ground truth objects that the model fails to detect.
  • Interpretation: These values are used to compute precision, recall, and other metrics that give insight into the model's detection performance.

Summary of Metrics

  • Precision: Measures how accurate the model's predictions are.
  • Recall: Measures how many of the actual objects are detected.
  • mAP: The most comprehensive metric for object detection evaluation, accounting for both precision and recall across multiple classes and IoU thresholds.
  • FPS: Indicates the speed of the model, important for real-time applications.

Which Metrics to Focus On

  • Accuracy-critical applications: Use mAP, AP, and F1 score to ensure the highest detection accuracy.
  • Real-time applications: Focus on FPS and inference time, in addition to precision and recall.
  • Imbalanced data: Prioritize models with a good balance of precision and recall, using F1 score and mAP as key indicators.

Integration with FastPix API

You can implement video surveillance using FastPix live streaming API and then you can parallelly inference for object detection using the above techniques.  

Live streaming with FastPix can be used to stream live events, video surveillance, long-hour streaming with low-latency which can be used for many other applications.

A step-by-step tutorial on live streaming with FastPix can be found here.

Applications of intelligent video surveillance

Intelligent video surveillance is transforming various sectors by enabling automated monitoring, real-time analytics, and actionable insights. Let’s explore each of these applications in detail, along with the latest developments and use cases:

Smart city traffic management

Description: Intelligent video surveillance plays a crucial role in managing traffic in modern smart cities. By detecting and analyzing traffic patterns, it helps in reducing congestion, enhancing road safety, and improving overall traffic flow.

How it works:

  • Vehicle detection and classification: Object detection algorithms identify different types of vehicles (e.g., cars, trucks, motorcycles) and analyze their movement in real-time.

cars, trucks, motorcycles
source : link.springer.com

  • Traffic flow analysis: The system continuously monitors traffic density at intersections and identifies congestion patterns.

Traffic Flow Analysis:

  • Dynamic traffic signal control: Based on real-time traffic data, the system can dynamically adjust traffic light durations to optimize vehicle flow and reduce wait times.

Dynamic Traffic Signal Control:
link.springer.com

Example: Cities like Singapore and Amsterdam are deploying AI-based video surveillance systems to monitor traffic and adjust signal timings in real-time, significantly reducing traffic congestion and enhancing road safety.

Retail analytics

Description: Intelligent video surveillance is revolutionizing the retail industry by offering insights into customer behavior, store traffic patterns, and product engagement. This data-driven approach helps retailers optimize product placement, improve store layout, and enhance the overall shopping experience.

How it works:

  • Customer footfall analysis: Cameras equipped with AI detect and count the number of customers entering and leaving the store.
  • Heatmaps: Video analytics generate heatmaps indicating which areas of the store attract the most customers, helping retailers optimize product placements.  

Source: Deloitte

  • Queue Management: Real-time monitoring of checkout lines enables dynamic adjustments, such as opening new registers to reduce wait times.

Example: Retail giants like Walmart and Amazon are using intelligent video surveillance to study customer interactions with products, enabling them to refine store layouts and target promotions more effectively.

Perimeter Security

Description: In security-sensitive areas such as military bases, airports, industrial facilities, and data centers, intelligent video surveillance is used to protect the perimeter from unauthorized access. It enhances security by detecting intruders and triggering automated alerts in real-time.

How it works:

  • Intruder Detection: AI algorithms analyze video feeds to detect unauthorized individuals entering restricted areas.

Intruder Detection:

  • Motion Tracking: The system tracks the movement of intruders to predict their next steps and assess the level of threat.
  • Integration with Alarm Systems: Automated alerts are triggered when an intruder is detected, and the video footage is sent to security personnel for immediate action.

Example: Airports like London Heathrow and military installations use AI-powered video surveillance to ensure tight perimeter security, reducing the need for human patrols and increasing the speed of threat detection.

Healthcare and Patient Monitoring

Description: In healthcare settings, intelligent video surveillance systems are used to monitor patients in real-time, providing critical support for improving patient safety, managing resources, and enhancing care delivery.

How it works:

  • Fall Detection: AI-based video surveillance systems are trained to recognize fall incidents in hospitals and care facilities, immediately notifying medical staff.


Fall Detection
www.kaggle.com

  • Patient Activity Monitoring: Systems track patient movement to ensure they remain in their designated areas and do not attempt to leave without permission.
  • Vital Signs Monitoring: Integrating video surveillance with AI to monitor vital signs, such as detecting respiratory distress or irregular movements.

Example: Hospitals like the Mayo Clinic are adopting AI-driven video surveillance to enhance patient monitoring, ensuring quick response times during emergencies and improving patient safety in intensive care units.

Manufacturing and Production

Description: In manufacturing and production environments, intelligent video surveillance plays a vital role in ensuring the safety, efficiency, and quality of operations. It helps monitor production lines, enhance worker safety, and ensure adherence to quality control standards.

How it works:

  • Quality Inspection: AI-based vision systems are used to detect defects or inconsistencies in products on the assembly line. These systems can automatically identify defects like scratches, dents, or misaligned components, ensuring only high-quality products are shipped.

Quality Inspection
smarttek.solutions

  • Process Optimization: Video analytics monitor the production workflow to identify bottlenecks or inefficiencies in the manufacturing process, helping optimize production speed and reduce waste.
  • Worker Safety Compliance: Cameras equipped with AI detect whether workers are following safety protocols, such as wearing protective gear or staying within safe zones.

Worker Safety Compliance
www.labellerr.com

Example: Companies like Siemens and General Electric have integrated AI-based video surveillance into their manufacturing processes to enhance quality control, streamline production workflows, and ensure a safer work environment for their employees.

Future trends in video surveillance and object detection

  • Integration with IoT: Combining object detection with IoT sensors for enhanced data accuracy and analytics.
  • AI-Powered Video Analytics: Leveraging AI to not just detect objects but also predict behaviors and potential threats.
  • 5G and Edge Computing: Ultra-fast 5G networks will enable real-time processing of video data, making intelligent surveillance more effective.

Conclusion

Object detection is transforming video surveillance by providing instant insights and smarter monitoring. It enables real-time detection of critical events, improving response times and security. With this feature, your system becomes faster, more efficient, and more accurate, giving you better control and peace of mind.

Adding this technology doesn’t have to be complicated. The FastPix object detection feature makes it easy, so you can focus on building better surveillance solutions without worrying about the technical details.

With just a few lines of code, you can quickly integrate object detection into your applications. Visit the feature page to learn how FastPix can enhance your video workflows.

Frequently Asked Questions (FAQs)

How does object detection work in video surveillance?

Object detection works by analyzing video frames to identify and locate objects such as people, vehicles, or other items of interest. The system uses machine learning algorithms to detect and track these objects in real time, making surveillance smarter and more.

What is YOLO and how is it used in object detection?

YOLO (You Only Look Once) is a deep learning algorithm used for real-time object detection. It scans images and videos to identify multiple objects at once, making it efficient for video surveillance and other real-time applications.

What are the best algorithms for video surveillance object detection?

Commonly used algorithms for video surveillance object detection include YOLO, Faster R-CNN, and SSD (Single Shot Multibox Detector). These algorithms are known for their speed and accuracy in identifying and tracking objects in real-time.

What is the difference between YOLO and Faster R-CNN for object detection?

YOLO (You Only Look Once) is a real-time object detection algorithm that processes images quickly, while Faster R-CNN is slower but provides higher accuracy. YOLO is more suitable for real-time surveillance applications, while Faster R-CNN may be better for detailed, high-accuracy detection in controlled environments.

How can I integrate object detection into my video surveillance system?

You can integrate object detection into your video surveillance system using APIs or tools that simplify the process. FastPix provides an easy way to add real-time object detection with just a few lines of code, improving your system’s efficiency.

Get Started

Enjoyed reading? You might also like

Try FastPix today!

FastPix grows with you – from startups to growth stage and beyond.