Open word search is an important aspect of many real-world computer vision tasks. However, the limited availability of recognition training data and the poor quality of pre-trained models often lead to sub-performance and scalability problems.
To address this challenge, the DeepMind research team introduced the OWLv2 model in their recent paper,Accessing open-vocabulary items.” This optimized architecture improves training efficiency and incorporates OWL-ST’s self-training formula, greatly enhancing recognition performance and achieving state-of-the-art results in open dictionary recognition.
The main objective of this work is to increase the label space, annotation filter and training efficiency for a self-training approach to open-word recognition, ultimately achieving robust and scalable open-word performance with limited label data.
The proposed self-training method consists of three key steps.
- The team uses an existing open word search engine to find open boxes on WebLI, a large-scale dataset of web image-text pairs.
- They use OWL-ViT CLIP-L/14 to annotate all WebLI images with mock-up boxes.
- They fine-tune the trained model using human-interpreted search data, further refining its performance.
Specifically, the researchers use the OWL-ViT architecture to train more efficient detectors. This architecture uses image-text models trained by contrast to initialize image and text references, with random initialization of detection heads.
In the training phase, the team uses similar losses and adds queries with “pseudo-negatives” from the OWL-ViT architecture.
In order to further increase the efficiency of training, they have incorporated procedures previously proposed for large transformer training. As a result, the OWLv2 model reduces training FLOPS by 50% and speeds up the training result by 2× compared to the original OWL-ViT model.
The team compares their proposed approach with previous state-of-the-art open-word searchers in their empirical study. The OWL-ST technique improves the average accuracy (AP) from 31.2% to 44.6% on LVIS sparse classes. Moreover, it combines the OWL-ST cooking process with the OWLv2 architecture to bring new state-of-the-art performance.
Overall, the OWL-ST algorithm proposed in this paper can greatly improve the detection performance of weak control by supporting large-scale web data and empowering web training for open-world environments. This approach solves the limitations caused by the lack of labeled recognition data and shows the possibility of obtaining a robust open dictionary in a cost-effective way.
Look Paper. Don’t forget to join Our 25k+ ML SubReddit, Discord Channel, And Email newsletterWhere we share the latest AI research news, cool AI projects and more. If you have any questions regarding the above article or if we missed something, feel free to email us [email protected]
Niharika is a Technical Consulting Intern at MarketTechPost. She is currently a third year undergraduate pursuing B.Tech from Indian Institute of Technology (IIT), Kharagpur. She has a keen interest in machine learning, data science and AI and is an avid reader of the latest developments in these fields.