As a follow-up to “
Synthetic data is not a replacement for high-quality original data, especially because of the risk of model collapse. This issue was highlighted in a conversation with
The paper points out that “When models are trained on data that has been generated recursively, they can become increasingly biased towards the synthetic data, leading to a degradation in performance when exposed to real-world data.”
The “
Understanding this susceptibility is crucial since transformers are widely used in machine learning applications. More research is needed to determine if the architecture itself contributes to model collapse or if the issue is primarily with the quality of synthetic data.
Both sides of the argument make compelling points about using synthetic data for model training, how transformers work, and why they might be impacted by recursive training on synthetic data. I also highlight the importance of data quality, lineage, observability, and monitoring as essential components for avoiding these pitfalls.
Model collapse is usually measured by evaluating the model’s performance on real data after training on synthetic data. E.g.:
The Nature article also highlights that model collapse worsens with more permutations. As models are repeatedly trained on synthetic data generated from other models, the biases and inaccuracies compound, leading to significant performance degradation.
From the Nature paper: a model can be trained to generate synthetic images of handwritten digits, such as those from the MNIST dataset. Initially, the model performs well, creating indistinguishable images from real ones. However, if this model is then used to generate a new training dataset, and subsequent models are trained on this recursively generated data, the quality of the images deteriorates. Over multiple generations, the images become increasingly distorted, losing the original characteristics of the handwritten digits. This recursive training amplifies the errors, leading to a collapse in the model’s ability to produce realistic images.
The crux of the issue with synthetic data and model collapse is that synthetic data is not a substitute for high-quality data. The papers discussed above repeatedly highlight that the quality of data used in training is critical to maintaining model performance and avoiding collapse. This is why data quality tooling around lineage, observability, and monitoring is so important.
Despite its limitations in training, synthetic data has valuable applications, especially when combined with Privacy Enhancing Technologies (PETs). For example:
I plan to follow up on the topic of Synthetic data with a balanced view on leveraging Synthetic Data and PETs, exploring their best uses and offering practical ideas for integrating these technologies into a comprehensive data strategy.
This paper examines whether model collapse, where models trained on synthetic data degrade in performance over time, is inevitable. The authors suggest a strategy to mitigate this risk by mixing real and synthetic data during training. Their approach, which involves preserving some real data while adding synthetic data, helps maintain model performance over successive generations. The paper emphasizes the mathematical basis of this method, showing how the inclusion of real data prevents the drift away from the original data distribution that typically leads to collapse.
This paper analyzes model collapse when models are trained recursively on data generated by previous models. The authors identify several factors contributing to collapse:
This seminal paper introduced the transformer architecture, which relies on self-attention mechanisms to capture long-range dependencies in data. While transformers are powerful, this strength can also lead to problems when trained on synthetic data. The self-attention mechanism tends to focus on patterns that may be artifacts of synthetic data rather than true features of the original data distribution. This can result in overfitting to non-representative patterns, leading to model collapse.
This Nature article highlights the long-term risks of training models on recursively generated data. The study finds that models progressively lose information about the true data distribution, particularly at the distribution’s tails, eventually converging to a distribution with reduced variance. The paper presents a theoretical framework explaining this collapse, showing it as a universal phenomenon across generative models. Even without estimation errors, the compounding of small inaccuracies over generations leads to collapse, emphasizing the need for access to original, human-generated data to prevent this outcome.
Alexander Watson’s article in Towards Data Science presents a counterargument to the concerns about model collapse. He acknowledges the risks but argues that these can be mitigated by strategically combining synthetic and real data during training. Watson suggests using differential privacy techniques and carefully curating synthetic datasets to ensure they reflect real-world data diversity. While synthetic data alone might lead to collapse, thoughtful integration with real data can preserve model performance and reduce the risk of degradation.
This paper examines Low-Rank Adaptation (LoRA) as a parameter-efficient finetuning method for large language models. LoRA trains only low-rank perturbations to selected weight matrices, saving memory and computational resources. The study finds that while LoRA underperforms compared to full finetuning, it offers a desirable form of regularization by maintaining the base model’s performance on tasks outside the target domain. LoRA helps mitigate the “forgetting” of the source domain, a key issue in model collapse. The authors provide a detailed analysis of LoRA’s performance across different domains and propose best practices for finetuning with LoRA.