What is synthetic data?
Synthetic data consists of faux, computer-generated random observations that mimic real data. Synthetic data is generated through a computational model, and the data links to real data only through its statistical properties.
Synthetic data aims to be an easy and secure drop-in replacement for real observations with similar details, distribution, and even occasional errors. Most notably, the computational model is designed not to reveal any sensitive details of the original data it simulates.
Why Synthetic Data?
Especially in health sector, real data can be unavailable or hard to utilize due to legal restrictions. These include limitations of the intended use, denial of merging the data with other data sets or disclosing any portions of it for software testing.
Even if real data is available, the security of personal information is a concern: In academic research, confidential data can be protected to some degree with pseudonymization, where the direct identifiers, such as names and social security numbers are replaced with codes that are known only to the data provider. Another powerful option is anonymization, where the data is scrambled beyond recognition even for the original register keeper that performs the anonymization. This is a strong property of anonymized data, required by the EU general data protection regulation (GDPR), which makes anonymizing multidimensional health data very challenging.
In novel use cases, real data might not even exist. For such cases one can fabricate data with ad hoc computational models, or better, with proper simulation models. Fabricated data makes sense in software development where, for example, user interfaces need to be tested with realistic-looking prototype data.
Synthetic data has a great potential in the education of domain experts such as statisticians, computer scientists and medical doctors, it can be used for entry-level scientific research, hypothesis testing, software testing, commercial research and development activities and as a test case for algorithm performance measurement. The advent of privacy regulation has made it both unavoidable and generally understood that confidential data needs to be protected. For these reasons, generating high-quality synthetic data can boost innovation in the health sector and, at the same time, guarantee that the public opinion remains favorable for the responsible use of national health registers.
Turku University Hospital (TYKS)
The Hospital District of Southwest Finland is a public joint municipal authority that produces specialised health care services in several hospitals, the largest of which is Turku University Hospitals (TYKS) in Turku. The hospital district offers extensive specialised health care services to the residents of the region which consists of 28 member municipalities with a total of over 470 000 residents, and more than 200 000 people use the services of the Turku University Hospital a year.
The hospitals and the other units of the hospital district admit patients from health centres, occupational health services, private practitioners, the wards of the hospital district, and patients referred by the government or insurance companies. TYKS hospital is also used for education, internship and scientific research with the motto: "Getting better, every day".
Dr. Arho Virkki is the Chief Data Office of Turku University Hospital and an Adjunct Professor at the Department of Mathematics and Statistics at the University of Turku. His research interests include machine learning, data mining and visual analytics for decision support. At Turku University Hospital, his key goals include providing researchers a safe and secure high-performance computing platform with real-life patient data in pseudonymized or anonymized format to facilitate the development of data-driven precision medicine.
Katariina Perkonoja is a candicate of natural sciences at the Department of Mathematics and Statistics at the University of Turku. She works on her graduate studies as a data scientist at Turku Univesity Hospital. Katariina is especially interested in data security and statistical modelling. Her special talent is combining rigorous mathematical statistics with software engineering to design, implement, test, and document scientific algorithms.