YOLO-World - recognizing an arbitrary number of objects with high accuracy and speed | Artificial Intelligence Blog

Just a few days ago, a new model of the Yolo family was presented. Its main trick is that, unlike its older brothers, it is able to recognize virtually any object in the image (which are of interest to a person) without prior training and it does it in real-time mode! Sounds pretty good, doesn't it?

In this article we will try to understand what magic is hiding inside the new architecture. I would like to say that this article will be more of an introductory one, so I recommend those who like strict math to read the original article after reading it. But before we start the review, let's learn/remember the main types of object detection tasks in an image.

Types of detection tasks I think that many of the readers who are already a little familiar with Computer Vision, when they think of a detection task, they mean traditional object detection. In this approach, the model is able to detect in images a strictly predefined list of objects on which it has been trained: yolo world

Another type, which has been gaining popularity in recent years due to its greater flexibility, as well as its often excellent performance without pre-training, is the so-called open vocabulary object detection. The main idea of detectors that solve the problem of recognition with an "open vocabulary" is to use not the integer class labels themselves, but embeddings (vector representations) of the names of these classes. This allows them to find even those classes that were not explicitly specified in advance, and to work well with phrases. For example, we can make the model search not just for cats, but for a specific breed, even if it was not in the training. Such detectors are capable of finding an almost unlimited number of classes of objects in an image: yolo world

It is worth noting here that by saying unlimited, we mean that one has to define a list of objects that one wants to recognize anyway, but this list can be however large (actually finite). For example, one could hand such detectors 21 thousand class names from the full ImageNet dataset and they would actually try to recognize every single one of them!

As it was mentioned in the beginning, the hero of this article Yolo-world is exactly one of the detectors capable of solving the more interesting second problem. But the authors of the article went further and suggested using the so-called prompt-then-detect approach. This means that if earlier we once created a large list of words of interest (online vocabulary, the vectors in which are static), now we form so-called promts, which are encoded in offline vocabulary and these embeddings go further down the pipeline. This approach reduces the number of computations for each input and also provides flexibility by adjusting the vocabulary as needed.

So, having realized the potential of the new model, I think many people are eager to get to the architecture itself and understand what its main features are. Let's get started!

Architecture Like many modern neural network architectures, YOLO-World can be broken down into many separate blocks. Let's take a closer look at some of them. yolo world

YOLO Detector

Here everything is simple: the relatively new YOLOv8 is used to extract image features. It in turn contains a Darknet backbone as an encoder and a PAN to generate multi-scale features.

Text Encoder

To get text embeddings, the already well-established CLIP is used, namely its transformer for text encoding. The CLIP encoder generates text vectors so that they can be well matched to their corresponding image vector representations (high cosine similarity).

Re-parameterizable Vision-Language PAN

Perhaps the basic and main building block of the entire architecture. It consists of top-down and bottom-up parts, within which previously extracted textual embeddings and multi-scale image features are mapped. yolo world

The block inside includes 2 main components: CSPLayer and Image-Pooling Attention. In layman's terms, the first one tries to add language information to the image elements, and the second one tries to put the information from the image into the text embeddings:

RepVL-PAN is followed by Box Head and Text Contrastive Head blocks. While the former obviously predicts the bounding boxes of objects, the latter their embeddings (based on object-text proximity).

Thus, at the end of the pipelines, we have the bounding boxes and embeddings of the objects in the image, as well as the text vectors of the classes we want to detect. Using a kind of matching, comparing pairwise proximities of the obtained vectors within the boxes, the output will be a list of found classes with corresponding probabilities (at a given proximity threshold).

To a first approximation, this is all about the model itself! In order not to complicate the presentation of the review, I have not given formulas and strict mathematical calculations. Those who wish to understand all the details and subtleties, I send them to read the original article. There you can also read how all this is trained with Region-Text Contrastive Loss and find a description of many experiments on additional training for specific tasks on different datasets for comparison with previous solutions.

Accuracy and speed of performance Of course, architecture is good, but what metrics does the new SOTA show in object recognition? Here the authors did not leave us without Speed-Accuracy graph, which without too much modesty declares a 20-fold performance improvement, with at least comparable accuracy (mAP) measured on the LVIS dataset: yolo world

This significant improvement in performance is primarily due to the lightweight Yolo backbone, which is essential for image feature extraction, whereas previous architectures used heavier transformers (e.g., Swin) for this purpose.

The quality of recognition is quite attributable to the main block (RepVL-PAN), which uses a multi-level cross-modal merging of fiches (texts and images).

Conclusion At the end we would like to emphasize the features of the new YOLO-World detector:

Able to recognize an unlimited number of objects (including phrases) out of the box

The largest model shows real-time speed on the inference (the first network with such an indicator for the task of OVD)

Uses both established architectures (Yolov8, CLIP) and potentially promising new ones (RepVL-PAN).

I can add that if you are interested in a good solution for recognizing objects of any nature and you do not have a large amount of data for training traditional detectors (or no data at all), you can safely look in the direction of using this model!

And of course, those who have read to the end may still have a question: is there open source? Fortunately, the answer is positive! However, at the time of writing the code with inference is not yet available, and the one that is available may be still raw. But the weights of the models are already posted in the open access, so those who are particularly curious can experiment).

Traduced form here