Machine Learning for Synthetic Data Generation:
A Review

Yingzhou Lu1, Minjie Shen2, Huazheng Wang3, Xiao Wang4, Capucine van Rechem1, Wenqi Wei56 1Department of Pathology, Stanford University, Stanford, CA, 94305.
2 The Bradley Department of Electrical and Computer Engineering, Virginia Tech
3School of Electrical Engineering and Computer Science, Oregon State University, Corvallis, OR, 97331.
4 School of Computer Science & Engineering University of Washington, Seattle, WA, 98105.
6 Computer and Information Science Department, Fordham University, New York City, NY, 10023.
5Corresponding author.
Contacting E-mail: lyz66@stanford.edu, wenqiwei@fordham.edu.Manuscript received xxxx xx, xxxx; revised xxxxx xx, xxxx.

Abstract

Machine learning heavily relies on data, but real-world applications often encounter various data-related issues. These include data of poor quality, insufficient data points leading to under-fitting of machine learning models, and difficulties in data access due to concerns surrounding privacy, safety, and regulations. In light of these challenges, the concept of synthetic data generation emerges as a promising alternative that allows for data sharing and utilization in ways that real-world data cannot facilitate. This paper presents a comprehensive systematic review of existing studies that employ machine learning models for the purpose of generating synthetic data. The review encompasses various perspectives, starting with the applications of synthetic data generation, spanning computer vision, speech, natural language processing, healthcare, and business domains. Additionally, it explores different machine learning methods, with particular emphasis on neural network architectures and deep generative models. The paper also addresses the crucial aspects of privacy and fairness concerns related to synthetic data generation. Furthermore, this study identifies the challenges and opportunities prevalent in this emerging field, shedding light on the potential avenues for future research. By delving into the intricacies of synthetic data generation, this paper aims to contribute to the advancement of knowledge and inspire further exploration in synthetic data generation.

Index Terms:
data synthesis, machine learning, data trustworthiness

I Introduction

Machine learning endows intelligent computer systems with the capacity to autonomously tackle tasks, pushing the envelope of industrial innovation [1] . By integrating high-performance computing, contemporary modeling, and simulations, machine learning has evolved into an indispensable instrument for managing and analyzing massive volumes of data [2, 3] .

Nonetheless, it is important to recognize that machine learning does not invariably resolve problems or yield the optimal solution. Despite artificial intelligence is currently experiencing a golden age, numerous challenges persist in the development and application of machine learning technology [4] . As the field continues to advance, addressing these obstacles will be essential for unlocking the full potential of machine learning and its transformative impact on various industries. Machine learning endows intelligent computer systems with the capacity to autonomously tackle tasks, pushing the envelope of industrial innovation [1] . By integrating high-performance computing, contemporary modeling, and simulations, machine learning has evolved into an indispensable instrument for managing and analyzing massive volumes of data [2, 3] .

The process of collecting and annotating data is both time-consuming and expensive [5] , giving rise to numerous issues. As machine learning is heavily dependent on data, some of the key hurdles and challenges it faces include:

Data quality . Ensuring data quality is one of the most significant challenges confronting machine learning professionals. When data is of subpar quality, models may generate incorrect or imprecise predictions due to confusion and misinterpretation [6] [7] .

Data scarcity . A considerable portion of the contemporary AI dilemma stems from inadequate data availability: either the number of accessible datasets is insufficient, or manual labeling is excessively costly [8] .

Data privacy and fairness . There are many areas in which datasets cannot be publicly released due to privacy ad fair issues. In these cases, generating synthetic data can be very useful, and we will investigate ways of creating anonymized datasets with differential privacy protections.

Refer to caption

Addressing these challenges will be essential for unlocking the full potential of machine learning and its transformative impact on various industries [9, 10, 11] . Generally, synthetic data are defined as the artificially annotated information generated by computer algorithms or simulations [12, 4] . Synthetic data is generally defined as artificially annotated information generated by computer algorithms or simulations [12, 4] . In many cases, synthetic data is necessary when real data is either unavailable or must be kept private due to privacy or compliance risks [10, 13, 14] . This technology is extensively utilized in various sectors, such as healthcare, business, manufacturing, and agriculture, with demand growing at an exponential rate [15] .

The objective of this paper is to offer a high-level overview of several state-of-the-art approaches currently being investigated by machine learning researchers for synthetic data generation. For the reader’s convenience, we summarize the paper’s main contributions as follows:

We present pertinent ideas and background information on synthetic data, serving as a guide for researchers interested in this domain.

We explore different real-world application domains and emphasize the range of opportunities that GANs and synthetic data generation can provide in bridging gaps (Section II ).

We examine a diverse array of deep neural network architectures and deep generative models dedicated to generating high-quality synthetic data, present advanced generative models, and outline potential avenues for future research (Section III and IV ).

We address privacy and fairness concerns, as sensitive information can be inferred from synthesized data, and biases embedded in real-world data can be inherited. We review current technological advancements and their limitations in safeguarding data privacy and ensuring the fairness of synthesized data (Section V and VI ).

We outline several general evaluation strategies to assess the quality of synthetic data (Section VIII ).

We identify challenges faced in generating synthetic data and during the deployment process, highlighting potential future work that could further enhance functionality (Section IX ).

Refer to caption

II-B Voice

The field of synthetic voice is at the forefront of technological advancement, and its evolution is happening at a breakneck pace. With the advent of machine learning and deep learning, creating synthetic voices for various applications such as video production, digital assistants, and video games [58] has become easier and more accurate. This field is an intersection of diverse disciplines, including acoustics, linguistics, and signal processing. Researchers in this area continuously strive to improve synthetic voices’ accuracy and naturalness. As technology advances, we can expect to see synthetic voices become even more prevalent in our daily lives, assisting us in various ways and enriching our experiences in many fields [59] .

The earlier study includes spectral modeling for statistical parametric speech synthesis, in which low-level, un-transformed spectral envelope parameters are used for voice synthesis. The low-level spectral envelopes are represented by graphical models incorporating multiple hidden variables, such as restricted Boltzmann machines and deep belief networks (DBNs) [60] . The proposed conventional hidden Markov model (HMM)-based speech synthesis system can be significantly improved in terms of naturalness and over-smoothing [61] .

Synthetic data can also be applied to Text-to-Speech (TTS) to achieve near-human naturalness [62, 63] . As an alternative to sparse or limited data, synthetic speech (SynthASR) was developed for automatic speech recognition. The combination of weighted multi-style training, data augmentation, encoder freezing, and parameter regularization is also employed to address catastrophic forgetting. Using this novel model, the researchers were able to apply state-of-the-art techniques to train a wide range of end-to-end (E2E) automatic speech recognition (ASR) models while reducing the need for production data and the costs associated with it [62] .

II-C Natural Language Processing (NLP)

The increasing interest in synthetic data has spurred the development of a wide array of deep generative models in the field of natural language processing (NLP) [51] . In recent years, a multitude of methods and models have illustrated the capabilities of machine learning in categorizing, routing, filtering, and searching for relevant information across various domains [64] .

Despite these advancements, challenges remain. For example, the meaning of words and phrases can change depending on their context, and homonyms with distinct definitions can pose additional difficulties [65] . To tackle these challenges, the BLEURT model was proposed, which models human judgments using a limited number of potentially biased training examples based on BERT. The researchers employed millions of synthetic examples to develop an innovative pre-training scheme, bolstering the model’s ability to generalize [66] . Experimental results indicate that BLEURT surpasses its counterparts on both the WebNLG Competition dataset and the WMT Metrics, highlighting its efficacy in NLP tasks [39] .

Another significant breakthrough in text generation using GANs is RelGAN, developed by Rice University. This model is comprised of three main components: a relational memory-based generator, a Gumbel-Softmax relaxation algorithm, and multiple embedded representations within the discriminator. When benchmarked against several cutting-edge models, RelGAN demonstrates superior performance in terms of sampling quality and diversity. This showcases its potential for further investigation and application in a wide range of NLP tasks and challenges [42, 67] .

II-D Healthcare

In order to protect health information and improve reproducibility in research, synthetic data has drawn mainstream attention in the healthcare industry [68, 69] . Many labs and companies have harnessed the tools of big data and advanced computation tools to produce large quantities of synthetic data [70] . Modeled after patient data, synthetic data generation is essential to understanding diseases while maintaining patient confidentiality and privacy simultaneously [71] . Theoretically, synthetic data can reflect the original distribution of the data instead of revealing actual patient data [71, 72, 73] .

Synthetic data generation can also be utilized to discover new scientific principles by grounding it in biological priors [68] . There have been a good number of models and software developed, such as SynSys, which uses hidden Markov models and regression models initially trained on real datasets to generate synthetic time series data consisting of nested sequences [69] ; and corGAN, in which synthetic data is generated by capturing correlations between adjacent medical features in the data representation space [26] .

Synthetic data generation has also been widely used in drug discovery, especially de novo drug molecular design. Drugs are essentially molecular structures with desirable pharmaceutical properties. The goal of de novo drug design is to produce novel and desirable molecule structures from scratch. The word “ de novo ” means from the beginning. The whole molecule space is around 10 60 superscript 10 60 10^ 10 start_POSTSUPERSCRIPT 60 end_POSTSUPERSCRIPT [74, 25, 75] . Most of the existing methods rely heavily on brute-force enumeration and are computationally prohibitive. Generative models are able to learn the distribution of drug molecules from the existing drug database and then draw novel samples (i.e., drug molecules) from the learned molecule distribution, including variational autoencoder (VAE) [76, 77, 21] , generative adversarial network (GAN) [78] , energy-based model (EBM) [79, 80] , diffusion model [81] , reinforcement learning (RL) [22, 82, 24] , genetic algorithm [83] , sampling-based methods [84, 85] , etc.

In healthcare, patient information is often stored in electronic health records (EHR) format. [86, 87, 88] . Research in medicine has been greatly facilitated by the availability of information from electronic health records [89, 90] . MedGAN, an adversarial network model for generating realistic synthetic patient records, has been proposed by Edward Choi and other colleagues. With the help of an autoencoder and generative adversarial networks, medGAN can generate high-dimensional discrete variables (e.g., binary and count features) based on real patient records [16] . Based on their evaluations of medGAN’s performance on a set of diverse tasks reported, including reporting distribution statistics, classification performance [91] , and expert review, medGAN exhibits close-to-real-time performance [92, 93, 94, 95, 16] . Using synthetic data can help reduce the regulatory barriers preventing the widespread sharing and integration of patient data across multiple organizations in the past [96, 97] . Researchers across the globe would be able to request access to synthetic data from an institution to conduct their own research using the data. Such capabilities can increase both the efficiency and scope of the study as well as reduce the likelihood of biases being introduced into the results [69, 98, 99] .

II-E Business

The inherent risk of compromising or exposing original data persists as long as it remains in use, particularly in the business sector, where data sharing is heavily constrained both within and outside the organization [100] . Consequently, it is crucial to explore methods for generating financial datasets that emulate the properties of ”real data” while maintaining the privacy of the involved parties [100] .

Efforts have been made to secure original data using technologies like encryption, anonymization, and cutting-edge privacy preservation [101] . However, information gleaned from the data may still be employed to trace individuals, thereby posing the risk [102] . A notable advantage of synthetic data lies in its ability to eliminate the exposure of critical data, thus ensuring privacy and security for both companies and their customers [103] .

Moreover, synthetic data enables organizations to access data more rapidly, as it bypasses privacy and security protocols [104] . In the past, institutions possessing extensive data repositories could potentially assist decision-makers in resolving a broad spectrum of issues. However, accessing such data, even for internal purposes, was hindered by confidentiality concerns. Presently, companies are harnessing synthetic data to refresh and model original data, generating continuous insights that contribute to enhancing the organization’s performance [4] .

II-F Education

Synthetic data is gaining increasing attention in the field of education due to its vast potential for research and teaching. Synthetic data refers to computer-generated information that mimics the properties of real-world data without disclosing any personally identifiable information [105] . This approach proves instrumental for educational settings, where ethical constraints often limit the use of real-world student data. Therefore, synthetic data offers a robust solution for privacy-concerned data sharing and analysis, enabling the creation of accurate models and strategies to improve the teaching-learning process.

A detailed example of synthetic data usage in education is the simulation of student performance data to aid in designing teaching strategies. Suppose an educational researcher wants to investigate the impact of teaching styles on student performance across different backgrounds and learning abilities. However, obtaining real student data for such studies can be ethically complex and potentially intrusive. In such a situation, synthetic data can be generated that mirrors the demographic distributions, learning patterns, and likely performance of a typical student population. This data can then be used to model the effects of various teaching strategies without compromising student privacy [106] .

Furthermore, synthetic data can be a powerful tool in teacher training programs. For example, teacher candidates can use synthetic student data to practice data-driven instructional strategies, including differentiated instruction and personalized learning plans. They can analyze this synthetic data, identify patterns, determining student needs, and adjusting their instructional plans accordingly. By using synthetic data, teacher candidates gain practical experience in analyzing student data and adapting their teaching without infringing on the privacy of actual students [107] . Thus, synthetic data serves as a valuable bridge between theory and practice in education, driving innovation while safeguarding privacy.

II-G Location and Trajectory Generation

Location and trajectory are a particular form of data that could highly reflect users’ daily lives, habits, home addresses, workplaces, etc. To protect location privacy, synthetic location generation is introduced as opposed to location perturbation [108] . The main challenge of generating synthetic location and trajectory data is to resemble genuine user-produced data while offering practical privacy protection simultaneously. One approach to generating the location and trajectory data is to inject a synthetic point-based site within a user’s trajectory [109, 110, 111] . Synthetic trajectory generation is frequently combined with privacy-enhancing techniques to further prevent sensitive inference from the synthesized data. For example, Chen et al. [112] introduces an N-gram-based method to predict the following position based on previous positions for publishing trajectory. They exploit the prefix tree to describe the n-gram model while combining it with differential privacy [113] . [114] proposes a synthetic trajectory strategy based on the discretization of raw trajectories using hierarchical reference systems to capture individual movements at differing speeds. Their method adaptively selects a small set of reference systems and constructs prefix tree counts with differential privacy. Applying direction-weighted sampling, the decrease in tree nodes reduces the amount of added noise and improves the utility of the synthetic data. By extracting multiple differential private distributions with redundant information, [115] the author generated a new trajectory with samples from these distributions. In addition to differential privacy, Bindschaedler and Shokri [116] enforce plausible deniability to generate privacy-preserving synthetic traces. It first introduces trace similarity and intersection functions that map a fake trace to a real hint under similarity and intersection constraints. Then, it generates one fake trace by clustering the locations and replacing the trajectory locations with those from the same group. If the fake trace satisfies plausible deniability, i.e., there exist k other real traces that can map to the fake trace, then it preserves the privacy of the seed trace. While existing studies mainly use the Markov chain model, [117] proposes to choose between the first-order and second-order Markov model adaptively. The proposed PrivTrace controls the space and time overhead by the first-order Markov chain model and achieves good accuracy for next-step prediction by the second-order Markov chain model.

II-H AI-Generated Content (AIGC)

AI-Generated Content (AIGC) stands at the forefront of the technology and content creation industry, changing the dynamics of content production. A typical example of AIGC is OpenAI’s ChatGPT, an AI-driven platform generating human-like text in response to prompts or questions. It leverages a vast corpus of internet text to generate detailed responses, often indistinguishable from those a human writer would produce. This capacity extends beyond simple question-answer pairs to crafting whole articles, stories, or technical explanations on a wide range of topics, thus creating a novel way of producing blog posts, articles, social media content, and more [118, 119] .

Google’s Project Bard focuses more on the creative aspects of text generation. It is designed to generate interactive fiction and assist in storytelling. Users can engage in an interactive dialogue with the model, directing the course of a narrative by providing prompts that the AI responds to, thus co-creating a story. This opens up fascinating possibilities for interactive entertainment and digital storytelling [120] .

An innovative application of AIGC is in the field of news reporting. News agencies are increasingly using AI systems, such as the GPT series, to generate news content. For instance, the Associated Press uses AI to generate news articles about corporate earnings automatically. The AI takes structured data about company earnings and transforms it into a brief, coherent, and accurate news report. This automation allows the agency to cover a much larger number of companies than would otherwise be possible with human journalists alone [121] .

Additionally, AIGC has found its place in the creative domain, with AI systems being used to generate book descriptions, plot outlines, and even full chapters of novels. For instance, a novelist could use ChatGPT to generate a synopsis for their upcoming book based on a few keywords or prompts related to the story. Similarly, marketing teams utilize AI to create compelling product descriptions for online marketplaces [122] . This not only increases efficiency but also provides a level of uniformity and scalability that would be challenging to achieve with human writers alone. Through these examples, it is clear that AIGC is profoundly impacting the landscape of content creation and will continue to shape it in the future [120] .

III Deep Neural Network

It is no secret that deep neural networks have become increasingly prominent in the field of computer vision and in other areas. Nevertheless, they require large amounts of annotated data for supervised training, which limits their effectiveness [123] . In this section, we review and compare various commonly-used deep neural network architectures as background knowledge, including multiple layer perception (MLP) in Section III-A , convolutional neural network (CNN) in Section III-B , recurrent neural network (RNN) in Section III-C , graph neural network (GNN) in Section III-D and transformer in Section III-E .

III-A Multiple Layer Perception (MLP)

Multiple layer perception (MLP) is a classical (or vanilla) feedforward artificial neural network that uses a fully-connected connection in a single layer. It models all the possible interactions between the features. It is also the most used neural network. In the i 𝑖 i italic_i -th layer, the propagation can be written as

where h i superscript ℎ 𝑖 h^ italic_h start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT is the input and h i + 1 superscript ℎ

𝑖 1 h^ italic_h start_POSTSUPERSCRIPT italic_i + 1 end_POSTSUPERSCRIPT is the output (which is also the input of the i + 1

𝑖 1 i+1 italic_i + 1 -th layer), 𝐖 𝐖 \mathbf bold_W is weight matrix, b 𝑏 b italic_b is the bias vector, both 𝐖 𝐖 \mathbf bold_W and b 𝑏 b italic_b are parameters of MLP. σ ⁢ ( ) 𝜎 \sigma() italic_σ ( ) denotes the activation function. Popular activation functions incorporate the sigmoid, ReLU, tanh, etc. The goal of the activation function is to provide nonlinear transformation. MLP is the basis of many neural network architectures. Please refer to [124] for more details about MLP.

III-B Convolutional Neural Network (CNN)

Convolutional neural network (CNN) was proposed to learn a better representation of images [124] . The core idea of the convolutional neural network is to design a two-dimensional convolutional layer that slides over the image to model the small-sized patch horizontally and vertically. Convolutional neural networks with one-dimensional convolutional layers can also model sequence data. Please refer to [124] for more details about CNN.

III-C Recurrent Neural Network (RNN)

III-D Graph Neural Network (GNN)

There are many graph-structured data in downstream applications, such as social networks, brain networks, chemical molecules, knowledge graphs, etc. Graph neural network was proposed to model graph-structured data and learn the topological structure from graph [128, 129, 130] . Specifically, graph neural networks build the interaction between the connected nodes and edges to model the topological structure of the graph [131, 132, 133, 134] . The feedforward rule of graph neural network can be formulated as

where h i ( l ) superscript subscript ℎ 𝑖 𝑙 h_^ italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT denotes the embedding of the i 𝑖 i italic_i -th node at the l 𝑙 l italic_l -th layer, m j subscript 𝑚 𝑗 m_ italic_m start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT denotes the node feature of the node j 𝑗 j italic_j , m j ⁢ i subscript 𝑚 𝑗 𝑖 m_ italic_m start_POSTSUBSCRIPT italic_j italic_i end_POSTSUBSCRIPT denotes the edge feature of the edge that connects nodes j and i, 𝒩 ⁢ ( i ) 𝒩 𝑖 \mathcal(i) caligraphic_N ( italic_i ) is the set of nodes that connect with i 𝑖 i italic_i . Within each layer, we update the current representation via aggregating the information from (1) representation from the previous layer, (2) node feature (3) edge feature. For example, for chemical compounds, each node corresponds to an atom, the node feature is the category of the atom, e.g., Carbon atom, Nitrogen atom, Oxygen atom; each edge is a chemical bond, the edge feature is the type of the bond, including single bond, double bond, triple bond and aromatic bond.

III-E Transformer

The Transformer architecture, introduced by Vaswani et al. in the groundbreaking paper “Attention is All You Need” in 2017, revolutionized the field of natural language processing and machine learning. Unlike traditional sequential models like RNNs (Recurrent Neural Networks) and LSTMs (Long Short-Term Memory), the Transformer model leverages a unique mechanism called self-attention, which allows it to capture long-range dependencies and relationships within the input data more effectively. This architecture consists of a stack of identical layers, each containing a multi-head self-attention mechanism followed by position-wise fully connected feed-forward networks. By eschewing recurrent or convolutional layers, the Transformer model is highly parallelizable and computationally efficient, leading to faster training times and improved performance on various NLP tasks.

The Generative Pre-trained Transformer (GPT) is a cutting-edge deep learning model that has revolutionized natural language processing (NLP) tasks [135] . Developed by OpenAI, GPT is an autoregressive transformer-based model that has displayed unparalleled performance in tasks such as text generation, translation, text summarization, and question-answering.

The model’s architecture consists of multiple self-attention mechanisms and position-wise feedforward layers, enabling it to capture long-range dependencies and generate highly coherent and contextually relevant text. The key to GPT’s success lies in its unsupervised pre-training on vast amounts of textual data, followed by fine-tuning on specific tasks. As the GPT model series progresses, with GPT-3 being the latest version at the time of this writing, the size and capabilities of the model continue to grow, paving the way for increasingly sophisticated NLP applications and opening up new possibilities for the generation of synthetic data.

By leveraging its pre-training on massive datasets and fine-tuning for specific tasks, GPT can produce artificial data samples that closely resemble real-world data. This capability is particularly valuable in scenarios where access to real data is limited due to privacy, regulatory, or resource constraints. GPT-generated synthetic data can be used to augment existing datasets, enabling researchers and practitioners to build more robust and accurate machine learning models while mitigating the risks associated with using sensitive or private data. Additionally, the synthetic data generated by GPT models can help address challenges related to data scarcity, class imbalance, or the need for domain-specific data, ultimately contributing to developing and deploying more effective AI solutions across various domains and applications.

IV Generative AI

Generative AI models refer to a wide class of AI methods that could learn the data distribution from existing data objects and generate novel structured data objects, which fall into the category of unsupervised learning. Generative AI models, also known as deep generative models, or distribution learning methods, learn the data distribution and samples from the learned distribution to produce novel data objects. In this section, we investigate several generative AI models that are frequently used in synthetic data generation, including the language model in Section IV-A , variational autoencoder (VAE) in Section IV-C , generative adversarial network (GAN) in Section IV-D , reinforcement learning (RL) in Section IV-E , and diffusion model in Section IV-F . Table II compares various generative AI methods from several aspects.

Refer to caption

Other strategies include

1. Data Anonymization: This process removes personally identifiable information from data sets, ensuring that the individuals the data describe remain anonymous. This is crucial in industries such as healthcare, where patient data privacy is a legal requirement.

2. Data Masking: This technique involves replacing sensitive data with fictitious yet realistic data. It is often used to protect the data while maintaining its usability for testing or development purposes. For example, a developer might need to use customer data to test a new feature, but they don’t need to know the actual personal details of the customer to do so.

3. Data Perturbation: This involves adding noise to the data to prevent the identification of individuals in the dataset while preserving the statistical properties of the data. This is particularly useful in research scenarios where data needs to be shared but individual privacy must be maintained.

4. Differential Privacy: This system publicly shares information about a dataset by describing the patterns of groups within the dataset while withholding information about individuals in the dataset. It provides a mathematical guarantee of privacy and is becoming an increasingly popular method for enhancing privacy in machine learning.

5. Federated Learning: This is a machine learning approach where the model is trained across multiple decentralized edge devices or servers holding local data samples, without exchanging them. This approach allows for the utilization of a wide array of data sources, while also ensuring that sensitive data does not leave its original device.

These techniques are all part of the broader field of synthetic data generation, which aims to create data that can be used for a variety of purposes (such as training machine learning models) without compromising privacy.

VI Fairness

Generating synthetic data that reflect the important underlying statistical properties of the real-world data may also inherit the bias from data preprocessing, collection, and algorithms [194] . The fairness problem is currently addressed by three types of methods [195] : (i) preprocessing, which revises input data to remove information correlated to sensitive attributes, usually via techniques like massaging, reweighting, and sampling. (ii) in-processing, which adds fairness constraints to the model learning process; and (iii) post-processing, which adjusts model predictions after the model is trained.

Most existing fairness-aware data synthesis methods leverage preprocessing techniques. The use of balanced synthetic datasets created by GANs to augment classification training has demonstrated the benefits for reducing disparate impact due to minoritized subgroup imbalance [196, 197, 198] . [199] models bias using a probabilistic network exploiting structural equation modeling as the preprocessing to generate a fairness-aware synthetic dataset. Authors in [200] leverage GAN as the pre-processing for fair data generation that ensures the generated data is discrimination free while maintaining high data utility. By comparison, [201] is geared towards high dimensional image data and proposes a novel auxiliary classifier GAN that strives for demographic parity or equality of opportunity. However, preprocessing would require the synthesized data provider to know all correlations, biases, and distributions of variables in the existing datasets as a priori. Compared to preprocessing, the latter two categories are less-developed for fair data synthesis.

In the meantime, differential privacy amplifies the fairness issues in the original data [202] . [203] demonstrate that differential privacy does not introduce unfairness into the data generation process or to standard group fairness measures in the downstream classification models, but does unfairly increase the influence of majority subgroups. Differential privacy also significantly reduces the quality of the images generated from the GANs, decreasing the synthetic data’s utility in downstream tasks. To measure the fairness in synthesized data, [92] develops two covariate-level disparity fairness metrics for synthetic data. The authors analyze all subgroups defined by protected attributes to analyze the bias.

In the emerging AIGC using foundation models, the generated images and texts may also inherit the stereotypes, exclusion and marginalization of certain groups and toxic and offensive information in the real-world data. This would lead to discrimination and harm to certain social groups. The misuse of such data synthesis approaches by misinformation and manipulation would lead to further negative social impact [204] . Given that the quality of the data generated by foundation models is inextricably linked to the quality of the training corpora, it is essential to regulate the real-world data being used to form the data synthesis distribution. While reducing bias in data is important, the remaining bias in the data may also be amplified by the models [195] or the privacy-enhancing components [202] . With frequent inspection and sensitive and toxic information removal on both data and model, it will help govern the information generated from those foundation models and ensure the models would do no harm (we hope!).

VII Trustworthiness

As data-driven decision making proliferates across various industries, the creation and use of synthetic data has become increasingly prevalent. Synthetic data, artificially generated data that simulates real-world scenarios, offers a way to bypass several problems associated with real data, such as privacy concerns, scarcity, or data collection difficulty. Nevertheless, the trustworthiness of synthetic data is a subject of ongoing debate, hinging on aspects such as data representativeness, privacy preservation, and potential biases.

For synthetic data to be trustworthy, it must offer a faithful statistical representation of the original data, while maintaining the inherent variability and structure. The risk lies in creating data that oversimplifies or misrepresents the complexities of real-world data, potentially leading to inaccurate conclusions or ineffective solutions when used in analysis or modelling.

Privacy preservation is another critical factor in synthetic data generation. Synthetic data is often utilized in situations where the use of real data may breach privacy regulations or ethical boundaries. While synthetic data promises a level of anonymity, there is an ongoing debate about the extent to which this data can be de-anonymized. If synthetic data could be traced back to the original contributors, it would undermine its trustworthiness and the privacy it promises to uphold.

Potential biases in synthetic data are a significant concern. Even though synthetic data is artificially generated, it often relies on real-world data to inform its creation. Thus, if the real-world data is biased, these biases could be unwittingly replicated in the synthetic data, perpetuating the same flawed patterns and undermining its trustworthiness.

Moreover, assessing the trustworthiness of synthetic data involves the evaluation of the synthetic data generation methods themselves. Transparency in the generation process, including a clear understanding of the underlying algorithms and parameters used, is crucial in judging the trustworthiness of the resultant synthetic data.

In conclusion, while synthetic data presents compelling benefits, the trustworthiness of such data depends on its representativeness, privacy preservation, and absence of bias. By recognizing and addressing these concerns, researchers and practitioners can make informed decisions about synthetic data’s validity and ethical implications. Transparent and robust synthetic data generation methods are paramount in fostering this trust.

VIII Evaluation Strategy

In this section, we discuss various approaches to evaluating the quality of synthesized data, which is essential for determining the effectiveness and applicability of synthetic data generation methods in real-world scenarios. We categorize these evaluation strategies as follows:

Human evaluation . This method is the most direct way to assess the quality of synthesized data. Human evaluation involves soliciting opinions from domain experts or non-expert users to judge the synthesized data’s quality, similarity to real data, or usability in specific applications. For example, in speech synthesis, the human evaluator rates the synthesized speech and real human speech in a blind manner [205, 44] . However, human evaluation has several drawbacks, including being expensive, time-consuming, error-prone, and not scalable. Additionally, it struggles with high-dimensional data that cannot be easily visualized and evaluated by humans.

Statistical difference evaluation . This strategy involves calculating various statistical metrics on both the synthesized and real datasets and comparing the results. For example, [53, 206] use first-moment statistics of individual features (e.g., medical concept frequency/correlation, patient-level clinical feature) to evaluate the quality of generated electronic health record (EHR) data. The smaller the differences between the statistical properties of synthetic and real data, the better the quality of the synthesized data.

Evaluation using a pre-trained machine learning model . As mentioned in Section IV-D , in the generative adversarial network (GAN), the discriminator differentiates fake data (synthesized data) from real ones. Consequently, the output of the discriminator can measure how closely synthetic data resembles real data. The performance of the discriminator on the synthesized data can be used as an indicator of how well the generator produces realistic data. This strategy can be applied not only to GANs but also to other generative models where a pre-trained machine learning model is used for evaluation.

Training on synthetic dataset and testing on the real dataset (TSTR) . This strategy involves using synthetic data to train machine learning models and assessing their prediction performance on real test data in downstream applications. High performance on real test data indicates that the synthetic data has successfully captured essential characteristics of the real data, making it a useful proxy for training. For example, [207] employ synthetic data to train machine learning models and assess their prediction performance on real test data in downstream applications. TSTR can provide insights into the effectiveness of synthetic data for training machine learning models in a wide range of tasks and domains.

Application-specific evaluation . Depending on the specific use case or domain, tailored evaluation methods may be employed to assess the quality of synthesized data. These evaluation methods can consider the unique requirements or constraints of the application, such as regulatory compliance, privacy concerns, or specific performance metrics. By evaluating the synthesized data in the context of its intended use, a more accurate assessment of its quality and applicability can be obtained.

These evaluation strategies offer various ways to gauge the quality of synthesized data, helping researchers and practitioners determine the effectiveness of synthetic data generation methods and their applicability in real-world scenarios. Employing a combination of these strategies can provide a more comprehensive understanding of the strengths and weaknesses of the synthesized data, facilitating further improvements in synthetic data generation techniques [208] .

IX Challenges and Opportunities

The aim of this research is to present a comprehensive survey of synthetic data generation—a promising and emerging technique in contemporary deep learning. This survey outlines current real-world applications and identifies potential avenues for future research in this field. The utilization of synthetic data has been proven effective across a diverse array of tasks and domains [9] . In this section, we delve into the challenges and opportunities presented by this rapidly evolving area.

First and foremost, evaluation metrics for synthetic data are essential to determine the reasonableness of the generated data. In industries like healthcare, where data quality is of paramount importance, clinical quality measures and evaluation metrics are not always readily available for synthetic data. Clinicians often struggle to interpret existing criteria such as probability likelihood and divergence scores when assessing generative models [68] . Concurrently, there is a pressing need to develop and adopt specific regulations for the use of synthetic data in medicine and healthcare, ensuring that the generated data meets the required quality standards while minimizing potential risks.

Secondly, due to limited attention and the challenges associated with covering various domains using synthetic data, current methods might not account for all outliers and corner cases present in the original data. Investigating outliers and regular instances and their impact on the parameterization of existing methods could be a valuable research direction [209] . To enhance future detection methods, it may be beneficial to examine the gap between the performance of detection methods and a well-designed evaluation matrix, which could provide insights into areas that require improvement.

Thirdly, synthetic data generation may involve underlying models with inherent biases, which might not be immediately evident [92] . Factors such as sample selection biases and class imbalances can contribute to these issues. Typically, algorithms trained with biases in sample selection may underperform when deployed in settings that deviate significantly from the conditions in which the data was collected [68] . Thus, it is crucial to develop methods and strategies that address these biases, ensuring that synthetic data generation leads to more accurate and reliable results across diverse applications and domains.

In general, the use of synthetic data is becoming a viable alternative to training models with real data due to advances in simulations and generative models. However, a number of open challenges need to be overcome to achieve high performance. These include the lack of standard tools, the difference between synthetic and real data, and how much machine learning algorithms can do to exploit imperfect synthetic data effectively. Though this emerging approach is not perfect now, with models, metrics, and technologies maturing, we believe synthetic data generation will make a bigger impact in the future.

X Conclusion

In conclusion, machine learning has revolutionized various industries by enabling intelligent computer systems to autonomously tackle tasks, manage and analyze massive volumes of data. However, machine learning faces several challenges, including data quality, data scarcity, and data governance. These challenges can be addressed through synthetic data generation, which involves the artificial annotation of information generated by computer algorithms or simulations. Synthetic data has been extensively utilized in various sectors due to its ability to bridge gaps, especially when real data is either unavailable or must be kept private due to privacy or compliance risks.

This paper has provided a high-level overview of several state-of-the-art approaches currently being investigated by machine learning researchers for synthetic data generation. We have explored different real-world application domains, and examined a diverse array of deep neural network architectures and deep generative models dedicated to generating high-quality synthetic data.

To sum up, synthetic data generation has enormous potential for unlocking the full potential of machine learning and its impact on various industries. While challenges persist in the development and application of machine learning technology, synthetic data generation provides a promising solution that can help address these obstacles. Future research can further enhance the functionality of synthetic data generation.

Acknowledgement. The authors acknowledge partial support by the xxx

References