Synthetic Data: the generative AI Cure for Medical Advancements

The modern-day party trick: give me any image you want to see, any at all… A painting resembling Starry Night with a hamster as the centrepiece? Voilà, Dall-e 2 can produce it (Figure 1). Or perhaps rather than an image, a song about how much a possum loves a banana? Easy, chatGPT can do it within seconds.

These programmes are examples of generative AI. In the last blog we discussed how machine learning could revolutionize the healthcare system, in this blog we focus on the contribution that generative AI can make to the field of medicine. Specifically, on the generation of synthetic data as a form of privacy-enhancing technology.

Privacy and data governance pose one of the greatest barriers to the implementation of digital technologies in healthcare. Balancing the constraints of maintaining patients’ privacy with the demand for data to train machine learning models, appears to be an impossible challenge.

There are serious concerns involved in giving tech companies and external organizations access to sensitive patient data. While the process of anonymizing the data can to some extent secure a patient’s privacy, it is not a fool-proof method because the remaining data itself could lead to individuals being reidentified.[1] The threat posed by reidentification is heightened if the data concerns rare diseases or small sample sizes. A number of controversial reidentification cases have led to public trust in the process being eroded despite the fact that, to date, none have taken place within practice or clinical trial settings. Furthermore, anonymizing data can involve the removal of values which reduce the utility of the data.[2]

Synthetic data may be the technological solution to these privacy issues. Synthetic data involves the creation of new data, based on the patterns and complexities that the AI has learnt, having been trained on real data. The synthetic data is then provided to third parties to train machine learning models or use for research without the fear of reidentification. The use of a generative AI to learn from the data means that secondary researchers and machine learning models only ever have access to synthetic data and not confidential patient data. The study of Beaulieu-Jones et al (2019) found that “synthetic data can be shared with others, enabling them to perform hypothesis-generating analyses as though they had the original trial data”.[3] A number of trials have shown that synthetic data produced is similar enough to the trial data, that when the data is analysed using statistical and/or machine learning analyses they produced the same results. Thus, making them an effective sample on which to train machine learning models.

There are a number of concerns involved in using synthetic data. A primary one is that the synthetic data does not pick up on the complexities of the real data, making it different in important ways. Even if it is successful in picking up on the primary patterns presented by the data, one must be sure that no biases or errors in the system exist which would lead to important variables being overlooked. Particularly, because those analysing the synthetic data would not have access to the real data to compare it to. Therefore, there is the risk that we won’t even be able to identify these inaccuracies. Research into how to improve synthetic data models is already being conducted but must continue.[4] A generic quality assessment metric for synthetic data should also be developed.[5]

While the importance of rigorous testing is emphasized, this should not dissuade the development of synthetic data. Its potential to ensure privacy and anonymity, as well as technological advancements through machine learning, is of immense value. We are often taught to fear the unknown, particularly when it comes to AI, but there is no halting development. Our only hope is to shape it into developments which are rigorously tested, regulated and reflect a future that we want for ourselves.


[1] Lubarsky, B., 2010. Re-identification of “anonymized” data. Georgetown Law Technology Review. Available online: https://www. georgetownlawtechreview. org/re-identification-of-anonymized-data/GLTR-04-2017.

[2] Park, N., Mohammadi, M., Gorde, K., Jajodia, S., Park, H. and Kim, Y., 2018. Data synthesis based on generative adversarial networks. arXiv preprint arXiv:1806.03384.

[3] Beaulieu-Jones, B.K., Wu, Z.S., Williams, C., Lee, R., Bhavnani, S.P., Byrd, J.B. and Greene, C.S., 2019. Privacy-preserving generative deep neural networks support clinical data sharing. Circulation: Cardiovascular Quality and Outcomes, 12(7), p.e005122.

[4] Dankar, F.K. and Ibrahim, M., 2021. Fake it till you make it: Guidelines for effective synthetic data generation. Applied Sciences, 11(5), p.2158.

[5] Murtaza, H., Ahmed, M., Khan, N.F., Murtaza, G., Zafar, S. and Bano, A., 2023. Synthetic data generation: State of the art in health care domain. Computer Science Review, 48, p.100546.

Written by Celene Sandiford, smartR AI

Recent News

Meet smartR AI at HETT, 26 – 27 September 2023

smartR AI are honoured to be invited by Facts and Dimensions to be featured on their stand at HETT this year. We will demo and discuss ready to roll AI solutions. Find out how our private smartR myGPT™ provides excellence in data analysis for the FAD data sets, as...