Generative AI: A New Milestone in Genetic Data Simulation

INRAE researchers and their partners have published a study in the journal GigaScience showing that generative artificial intelligence (gAI) models can accurately reproduce complex genomic data structures, thereby avoiding the risks associated with sharing sensitive real-world data. This work opens up new avenues for research in human and animal genetics, particularly for studying the links between the genome and biological traits, by balancing data openness with the protection of data confidentiality.

Generating artificial genomes from real-world data

Modern genetics today relies on vast databases containing information derived from DNA sequencing and genotyping. These resources are essential for understanding the links between the genome and biological traits (phenotypes), but their use remains limited by several constraints: the cost of production and storage, difficulties in accessing the data, and associated privacy concerns.

To address these challenges, researchers have explored the use of AI models capable of learning from real-world data to generate synthetic data that is statistically similar to the original. The study thus focuses on the simulation of genotypes—that is, genetic variations occurring at different locations in the genome.

A Shift in Scale in Genomics

Until now, AI approaches applied to genomics have been limited to partial data, such as gene expression levels or the structure of small segments of the genome. Here, the researchers have taken a new step forward by simulating genotypes across multiple chromosomes, up to a scale that is nearly the entire genome.

They used several families of AI-based models already widely used in other fields, such as image or text simulation: variational autoencoders (VAE), diffusion models (DM), generative adversarial networks (GAN), and an improved version of the latter, called WGAN. Unlike traditional simulation approaches, which require defining complex biological hypotheses or statistical parameters in advance, these IAg models automatically learn the genetic structures present in the data.

The models were trained and tested on large-scale datasets, including all bovine chromosomes (excluding sex and mitochondrial chromosomes) and several human chromosomes. The researchers evaluated these approaches using two large datasets: one comprising more than 93,000 Holstein cows genotyped at over 50,000 genetic markers, and another comprising more than 291,000 individuals from the UK Biobank database.

For cattle, the researchers assessed whether the simulated data could identify known links between genomic variations and milk fat content, an important trait for the dairy industry. For the human data, they focused in particular on links to height.

Preserving Essential Biological Relationships

The results show that certain models, particularly WGANs, accurately reproduce several major characteristics of real genetic data and preserve biologically relevant relationships between the genome and the phenotype. This ability to preserve complex biological relationships is key to the scientific use of these data.

The researchers also demonstrate that an analysis of associations between genetic variants linked to dairy phenotype (GWAS), conducted on either real or artificial data, yields very similar results.

Outlook: Toward Genomics Enhanced by Synthetic Data

The use of synthetic data opens up concrete possibilities:

  • overcoming barriers to accessing genetic data that is sensitive or costly to produce;
  • ensuring the confidentiality of individual data while preserving useful statistical properties;
  • gaining access to massive volumes of data to train and test new models.

The authors emphasize, however, that several challenges remain, particularly in better replicating rare genetic variants and accounting for population diversity.

By combining methodological robustness with ethical considerations, this breakthrough illustrates the potential of generative artificial intelligence as a transformative force for genetic research..

Reference :  Sihan Xie, Thierry Tribout, Didier Boichard, Blaise Hanczar, Julien Chiquet, Eric Barrey, Learning inherent genetic patterns and trait associations with deep generative models for discrete genotype simulation, GigaScience, Volume 15, 2026, giag044, https://doi.org/10.1093/gigascience/giag044

All of the code developed as part of this study is available (published as open source) in the spirit of open science, promoting reproducibility and future developments in this field..

Contact :

  • Eric Barrey, Research Director, UMR GABI
  • Julien Chiquet, Research Director, UMR MIA Paris-Saclay
  • Blaise Hanczar, University Professor, IBISC Laboratory, University of Évry Paris-Saclay