- Published on
YOLO-World - recognizing an arbitrary number of objects with high accuracy and speed
- Authors
Just a few days ago, a new model of the Yolo family was presented. Its main trick is that, unlike its older brothers, it is able to recognize virtually any object in the image (which are of interest to a person) without prior training and it does it in real-time mode! Sounds pretty good, doesn't it?
In this article we will try to understand what magic is hiding inside the new architecture. I would like to say that this article will be more of an introductory one, so I recommend those who like strict math to read the original article after reading it. But before we start the review, let's learn/remember the main types of object detection tasks in an image.
![yolo world](/_next/image?url=%2Fstatic%2Fimages%2Fyolow%2F1.png&w=1920&q=75)
![yolo world](/_next/image?url=%2Fstatic%2Fimages%2Fyolow%2F2.png&w=1920&q=75)
It is worth noting here that by saying unlimited, we mean that one has to define a list of objects that one wants to recognize anyway, but this list can be however large (actually finite). For example, one could hand such detectors 21 thousand class names from the full ImageNet dataset and they would actually try to recognize every single one of them!
As it was mentioned in the beginning, the hero of this article Yolo-world is exactly one of the detectors capable of solving the more interesting second problem. But the authors of the article went further and suggested using the so-called prompt-then-detect approach. This means that if earlier we once created a large list of words of interest (online vocabulary, the vectors in which are static), now we form so-called promts, which are encoded in offline vocabulary and these embeddings go further down the pipeline. This approach reduces the number of computations for each input and also provides flexibility by adjusting the vocabulary as needed.
![yolo world](/_next/image?url=%2Fstatic%2Fimages%2Fyolow%2F3.png&w=2048&q=75)
![yolo world](/_next/image?url=%2Fstatic%2Fimages%2Fyolow%2F4.png&w=2048&q=75)
YOLO Detector
Here everything is simple: the relatively new YOLOv8 is used to extract image features. It in turn contains a Darknet backbone as an encoder and a PAN to generate multi-scale features.
Text Encoder
To get text embeddings, the already well-established CLIP is used, namely its transformer for text encoding. The CLIP encoder generates text vectors so that they can be well matched to their corresponding image vector representations (high cosine similarity).
Re-parameterizable Vision-Language PAN
![yolo world](/_next/image?url=%2Fstatic%2Fimages%2Fyolow%2F5.png&w=1920&q=75)
The block inside includes 2 main components: CSPLayer and Image-Pooling Attention. In layman's terms, the first one tries to add language information to the image elements, and the second one tries to put the information from the image into the text embeddings:
RepVL-PAN is followed by Box Head and Text Contrastive Head blocks. While the former obviously predicts the bounding boxes of objects, the latter their embeddings (based on object-text proximity).
Thus, at the end of the pipelines, we have the bounding boxes and embeddings of the objects in the image, as well as the text vectors of the classes we want to detect. Using a kind of matching, comparing pairwise proximities of the obtained vectors within the boxes, the output will be a list of found classes with corresponding probabilities (at a given proximity threshold).
To a first approximation, this is all about the model itself! In order not to complicate the presentation of the review, I have not given formulas and strict mathematical calculations. Those who wish to understand all the details and subtleties, I send them to read the original article. There you can also read how all this is trained with Region-Text Contrastive Loss and find a description of many experiments on additional training for specific tasks on different datasets for comparison with previous solutions.
![yolo world](/_next/image?url=%2Fstatic%2Fimages%2Fyolow%2F6.png&w=1200&q=75)
This significant improvement in performance is primarily due to the lightweight Yolo backbone, which is essential for image feature extraction, whereas previous architectures used heavier transformers (e.g., Swin) for this purpose.
The quality of recognition is quite attributable to the main block (RepVL-PAN), which uses a multi-level cross-modal merging of fiches (texts and images).
Conclusion At the end we would like to emphasize the features of the new YOLO-World detector:
Able to recognize an unlimited number of objects (including phrases) out of the box
The largest model shows real-time speed on the inference (the first network with such an indicator for the task of OVD)
Uses both established architectures (Yolov8, CLIP) and potentially promising new ones (RepVL-PAN).
I can add that if you are interested in a good solution for recognizing objects of any nature and you do not have a large amount of data for training traditional detectors (or no data at all), you can safely look in the direction of using this model!
And of course, those who have read to the end may still have a question: is there open source? Fortunately, the answer is positive! However, at the time of writing the code with inference is not yet available, and the one that is available may be still raw. But the weights of the models are already posted in the open access, so those who are particularly curious can experiment).