As reported by Financial Times, AI companies are exploring a new approach to obtain data for powerful generative models: generating information from scratch using synthetic data. Microsoft, OpenAI, and Cohere are among those employing synthetic data—computer-generated information—to train their large language models (LLMs) due to limitations in human-made data.
The launch of Microsoft-backed OpenAI’s ChatGPT has led to various products that generate plausible text, images, or code based on simple prompts. Generative AI has attracted significant interest, with tech giants like Google, Microsoft, and Meta competing.
LLMs powering chatbots like ChatGPT and Google’s Bard primarily rely on web scraping techniques to accumulate data from books, articles, social media, videos, and more.
However, as generative AI software becomes increasingly sophisticated, AI companies face data access and privacy concerns challenges. Synthetic data offers a solution by being cost-effective.
Cohere and competitors use synthetic data generated by AI models and fine-tuned by humans. For example, Cohere might use two AI models simulating a conversation between a math tutor and a student to train a model on advanced mathematics.
Recent research from Microsoft shows synthetic data can effectively train smaller, simpler models. One instance involved a synthetic dataset of short stories generated by GPT-4, which trained a simple LLM to produce coherent and grammatically correct stories.
Startups like Scale AI and Gretel.ai offer synthetic data services, preserving privacy and removing biases. Synthetic data helps financial institutions examine fraud scenarios and other applications.
Critics warn using AI-generated raw data could degrade the technology over time with falsehoods. Nevertheless, AI researchers see synthetic data as a path to superintelligent AI that can create knowledge and ask questions.