Data is the new soil, and in this fertile new ground, MIT researchers are planting more than just pixels. By using synthetic images to train machine learning models, a team of scientists recently surpassed results obtained from traditional “real-image” training methods.
At the heart of the approach is a system called StableRep, which does not only use computer-generated images; it generates them via ultra-popular text-image models like Stable Diffusion. It’s like creating worlds with words.
So what’s in StableRep’s secret sauce? A strategy called “multi-positive contrastive learning.”
“We teach the model to learn about high-level concepts through context and variance, not just to feed it data,” says Lijie Fan, an MIT electrical engineering doctoral student affiliated with the Computer Science Lab. and Artificial Intelligence at MIT (CSAIL). ), principal investigator of the work. “When multiple images, all generated from the same text, all treated as representations of the same underlying thing, the model digs deeper into the concepts behind the images, for example the object, not just their pixels.”
This approach considers multiple images generated from identical text prompts as positive pairs, providing additional information during training, not only adding more diversity, but specifying to the vision system which images are similar and which are different. Remarkably, StableRep outperformed the prowess of high-level models trained on real images, such as SimCLR and CLIP, in large datasets.
“While StableRep helps alleviate the challenges of acquiring data in machine learning, it also ushers in a new era of AI training techniques. The ability to produce diverse, high-caliber synthetic images to order could help reduce time-consuming expenses and resources,” says Fan.
The data collection process has never been simple. In the 1990s, researchers had to manually capture photographs to assemble datasets of objects and faces. The 2000s saw individuals scouring the internet for data. However, this raw, uncurated data often contained deviations from real-world scenarios and reflected societal biases, presenting a distorted view of reality. The task of cleaning datasets through human intervention is not only expensive, but also extremely difficult. Imagine, however, if this arduous data collection could be reduced to something as simple as issuing a command in natural language.
A key aspect of StableRep’s triumph is the adjustment of the “guidance scale” in the generative model, which ensures a delicate balance between the diversity and fidelity of synthetic images. Once refined, the synthetic images used in training these self-supervised models were found to be as effective, if not more so, than real images.
Going further, language supervision has been added to the mix, creating an improved variant: StableRep+. When trained with 20 million synthetic images, StableRep+ not only achieved superior accuracy, but also displayed remarkable efficiency compared to CLIP models trained with a staggering 50 million real images.
Yet the road ahead is not without its potholes. The researchers candidly address several limitations, including the current slowness of image generation, semantic mismatches between text prompts and resulting images, potential bias amplification, and the complexity of image attribution, which it is imperative to resolve for future progress. Another problem is that StableRep requires training the generative model on large-scale real data first. The team recognizes that starting with real data remains a necessity; However, when you have a good generative model, you can reuse it for new tasks, like training recognition models and visual representations.
The team notes that they didn’t get around the need to start with real data; it’s just that once you have a good generative model, you can reuse it for new tasks, like training recognition models and visual representations.
Although StableRep offers a good solution by reducing reliance on large collections of real images, it highlights concerns about hidden biases in the noncurated data used for these text-image models. The choice of textual prompts, an integral part of the image synthesis process, is not entirely free of bias, “indicating the essential role of meticulous text selection or possible human curation,” says Fan.
“With the latest text-to-image conversion models, we have gained unprecedented control over image generation, enabling a diverse range of visuals from a single text input. This surpasses real-world image collection in terms of efficiency and versatility. It is particularly useful in specialized tasks, such as balancing the variety of images in long-tail recognition, which is a practical complement to using real images for training,” says Fan. “Our work represents a step forward in visual learning, toward the goal of providing cost-effective training alternatives while highlighting the need to continually improve data quality and synthesis.”
“One of the dreams of generative model learning has long been to be able to generate data that is useful for training discriminative models,” says David Fleet, a Google DeepMind researcher and professor of computer science at the University of Toronto. who did not participate in the article. “Although we have seen some signs of life, the dream has remained elusive, especially in complex large-scale areas like high-resolution images. This article provides, for the first time to my knowledge, irrefutable proof that the dream is becoming a reality. They show that contrast learning from massive amounts of synthetic image data can produce representations that outperform those learned from large-scale real data, with the potential to improve myriad downstream vision tasks.
Fan is joined by Yonglong Tian PhD ’22 as lead authors of the paper, as well as Phillip Isola, associate professor of electrical engineering and computer science at MIT and CSAIL principal investigator; Huiwen Chang, Google researcher and OpenAI technical staff member; and Dilip Krishnan, research scientist at Google. The team will present StableRep at the 2023 Neural Information Processing Systems (NeurIPS) conference in New Orleans.