GSoc/2023/StatusReports/QuocHungTran

Add Automatic Tags Assignment Tools and Improve Face Recognition Engine for digiKam

digiKam is an advanced open-source digital photo management application that runs on Linux, Windows, and macOS. The application provides a comprehensive set of tools for importing, managing, editing, and sharing photos and raw files.

The goal of this project is to develop a deep learning model that can recognize various categories of objects, scenes, and events in digital photos, and generate corresponding keywords that can be stored in Digikam's database and assigned to each photo automatically. The model should be able to recognize objects such as animals, plants, and vehicles, scenes such as beaches, mountains, and cities,... The model should also be able to handle photos taken in various lighting conditions and from different angles.

Mentors : Gilles Caulier, Maik Qualmann, Thanh Trung Dinh

Project Proposal

Automatic Tags Assignment Tools and Improve Face Recognition Engine for digiKam Proposal

GitLab development branch

gsoc23-autotags-assignment

Contacts

Email: [email protected]

Github: quochungtran

Invent KDE: quochungtran

LinkedIn: https://www.linkedin.com/in/tran-quoc-hung-6362821b3/

Project goals

Links to Blogs and other writing

Main merge request

KDE repository for object detection and face recognition researching

Issue tracker

My blog for GSoC

My entire blog :

https://quochungtran.github.io/

May 29 to June 11 (Week 1 - 2) - Experimentation on COCO dataset

In this phase, I focus mainly on offline analysis, this analysis aims to create a Deep learning pipeline for object detection model

DONE

Constructing data sets (training dataset, validation dataset and testing dataset) firsly in some common kind of objects as person, bicycle, car.
Preprocessing data, studying about construct of COCO dataset which is used for training dataset and validation dataset.
Research and create model pipeline for all YOLO version in python.
Evaluate performance of YOLO methode by considering some evaluated metrics.

TODO

Construct of COCO dataset format

The Common Object in Context (COCO) is widely recognized as one of the most popular and extensively labeled large-scale image datasets available for public use. It encompasses a diverse range of objects that we encounter in our daily lives, featuring annotations for over 1.5 million object instances across 80 categories. To explore the COCO dataset, you can visit the dedicated dataset section on SuperAnnotate's platform.

The data in the COCO dataset is stored in a JSON file, which is organized into sections such as info, licenses, categories, images, and annotations. To acquire the COCO dataset, I specifically utilized the "instances_train2017.json" and "instances_val2017.json" files, which are readily available for download.

   "info": {
       "year": "2021",
       "version": "1.0",
       "description": "Exported from FiftyOne",
       "contributor": "Voxel51",
       "url": "https://fiftyone.ai",
       "date_created": "2021-01-19T09:48:27"
   },
   "licenses": [
       {
         "url": "http://creativecommons.org/licenses/by-nc-sa/2.0/",
         "id": 1,
         "name": "Attribution-NonCommercial-ShareAlike License"
       },
       ...   
   ],
   "categories": [
       ...
       {
           "id": 2,
           "name": "cat",
           "supercategory": "animal"
       },
       ...
   ],
   "images": [
       {
           "id": 0,
           "license": 1,
           "file_name": "<filename0>.<ext>",
           "height": 480,
           "width": 640,
           "date_captured": null
       },
       ...
   ],
   "annotations": [
       {
           "id": 0,
           "image_id": 0,
           "category_id": 2,
           "bbox": [260, 177, 231, 199],
           "segmentation": [...],
           "area": 45969,
           "iscrowd": 0
       },
       ...
   ]

To extract the necessary information from the COCO dataset, I utilized the COCO API, which greatly assists in loading, parsing, and visualizing annotations within the COCO format. This API provides support for multiple annotation formats.

The following table provides an overview of some useful COCO API functions:

APIs	Description
getImgIdsGet	Get img ids that satisfy given filter conditions.
getCatIdsGet	Get cat ids that satisfy given filter condition
getAnnIdsGet	Get ann ids that satisfy given filter conditions.

FInitially, my focus was on benchmarking the model using common object categories such as person, bicycle, and car. For these specific subcategories, there are currently 1,101 training images available, while there are 45 validation images.

For the testing dataset, I decided to manually label it by utilizing a custom dataset provided by the user. This approach allows for a real-world use case scenario.

Additionally, here is a table listing some of the existing pre-trained labels in the YOLO format:

Categories	Sub Categories
People and animals	person, cat, dog, horse, elephant, bear, etc.
Vehicles	bicycle, car, motorcycle, airplane, bus, train, truck, boat, etc.
Traffic-related objects	traffic light, stop sign, parking meter, etc.
Furniture	chair, couch, bed, dining table, etc.
Food and drink	banana, apple, sandwich, pizza, wine glass, cup, etc.
Sports equipment	skis, snowboard, tennis racket, sports ball, etc.
Electronic devices	TV, laptop, cell phone, remote, etc.
Household items	umbrella, backpack, handbag, suitcase, etc.
Kitchenware	fork, knife, spoon, bowl, etc.
Plants and decoration	potted plant, vase, etc.

Below, you can find some samples from the training dataset. Each image contains multiple object annotations in the form of bounding boxes, denoted by their top-left corner coordinates (x, y) and their respective width and height (w, h).

YOLO model pipeline

Load the version YOLO network

So we can see that YOLO — You Only Look Once — is an extremely fast multi object detection algorithm which uses convolutional neural network (CNN) to detect and identify objects.

I would like to build a pipeline for YOLO detection in Python first. In order to load the model I have to download the pre-trained YOLO weight file and also the YOLO configuration file. Here I use version v3 in the first step:

This YOLO neural network has 254 elements consist of convolutional layers (conv), rectifier linear units (relu) etc.

  net = cv.dnn.readNetFromDarknet('yolov3.cfg', 'yolov3.weights')

Create a blob

The input to the network is a so-called blob object. The function cv.dnn.blobFromImage(img, scale, size, mean) transforms the image into a blob. In fact, this process is considered as processing data, to obtain (correct) predictions from deep neural networks we first need to preprocess our data.

These two functions perform:

Resizing: It resizes the input image to a specific size required by the model. Deep learning models often have fixed input sizes, and the blobFromImage function ensures that the image is resized to match these requirements.

Normalizing image: Dividing the image by 255 ensures that the pixel values are scaled between 0 and 1, can help ensure that gradients during the backpropagation process are within a reasonable range. This can aid in more stable and efficient convergence during training.

Mean Subtraction: It subtracts the mean values from the image. Mean subtraction helps in normalizing the pixel values and removing the average color intensity. The mean values used for subtraction are usually pre-defined based on the dataset used to train the model.

Channel Swapping: It reorders the color channels of the image. Deep learning models often expect images in a specific channel order, such as RGB (Red, Green, Blue). If the input image has a different channel order, the blobFromImage function swaps the channels accordingly.

Here I set the scale factor equal to 1/255. This factor helps to keep the lightness of an image is the same as the original.

After transfer to blob, A blob is a 4D numpy array object (images, channels, width, height) after resize into (416, 416). The image below shows the 3 channels red, blue and green channel of the blob.

The blob object is given as input to the network, and The forward propagation to retrieve all the layer names from the network and determines the output layers.

The outputs object are vectors of length 85

4x the predicted bounding box (centerx, centery, width, height)
1x box confidence: refers to the confidence score or probability assigned to the predicted bounding box. It represents the model's estimation of how confident it is that the bounding box accurately encloses an object in the image.
80x class confidence : these scores indicate the probabilities or confidences that an object detected in the image belongs to a particular class among 80 classes

Post processing

After obtaining the bounding boxes and their corresponding confidences from the output of an object detection model, Non-Maximum Suppression (NMS) is commonly applied to select the best bounding boxes.

NMS is a post-processing technique used to reduce redundant or overlapping bounding box detections. It aims to select the most accurate and representative bounding boxes while removing duplicates or highly overlapping detections.

The cv.dnn.NMSBoxes() function you mentioned is a utility function provided by OpenCV that performs NMS. It takes several parameters:

boxes: This parameter represents the bounding boxes detected in the image. Each bounding box is typically represented as a list of four values (x, y, width, height) or as a tuple.

confidences: This parameter contains the confidence scores associated with each bounding box. The confidence scores indicate the likelihood that the corresponding bounding box contains an object of interest.

score_threshold: This parameter specifies the minimum confidence score threshold for considering a bounding box during NMS. Any bounding box with a confidence score below this threshold will be disregarded.

nms_threshold: This parameter determines the overlap threshold for suppressing redundant bounding boxes. If the overlap between two bounding boxes exceeds this threshold, the one with the lower confidence score is suppressed.

The cv.dnn.NMSBoxes() function returns the indices of the selected bounding boxes that passed the NMS process. These indices correspond to the original list of bounding boxes and confidences, allowing you to access the selected boxes and their associated information.

In this project, user can choose specific options from 80 predefined classes to automatically tag in input images, after predicted process. For example from one sample containing person, car and bicycle, the relevant predicted bbox can be showed:

Here is the result when we pass through Yolo, each picture is showed what object box following specific choice from use