Arabidopsis Dataset

A Multimodal Dataset on the Genome of the Model Plant Arabidopsis thaliana with Application Towards Hypergraph Learning

Abstract

The genome of an organism offers a wealth of information vital to understanding how it reacts to outside stimuli. One such model plant, called Arabidopsis thaliana, has been critical to developing biology researchers' understanding of other plants, including crop plants. Meanwhile, in recent years, hypergraphs and deep learning have seen substantial progress toward understanding datasets with higher-order relationships. Gene data and relationships within the genome fit well into the hypergraph structure, where its hyperedges naturally represent biological functions genes are known to contribute to. This is more intuitive than pair-wise edges in simple graphs, where connecting genes is not as straightforward. To the best of our knowledge, there is no work that connects hypergraphs and deep learning towards a complete set of genes as the nodes. To connect the biology and deep learning communities, we bring together different sources of gene and function information into one data package. We provide details over what we call the Arabidopsis dataset as well as go over transcriptomic data from Arabidopsis plants we have assembled. We then provide baseline experimental results to showcase how hypergraph models learn correlations amongst gene features to predict the up- or down-regulation with regards to gene expression. We further provide justification for using hypergraphs over graphs for this dataset. Finally, to address the challenges of our dataset, we discuss the experimental results and offer advice for future directions.

Highlights

We organize a dataset for gene information and relevant biological function information from multiple public sources concerning Arabidopsis thaliana, and then release in one package.
On top of the gene information, we have assembled transcriptomic data using 24 mutant Arabidopsis plants or specimens.
We conduct experiments using simple graph- and hypergraph-backbone and show that hypergraphs, with their more compact edge representation, are able to learn across connections of genes, while simple graphs struggle to do so.

Arabidopsis Dataset Statistics

The hypergraph data structure allows for a more compact representation compared to the simple graph.

38,286 genes with text descriptions, gene nucleotide sequences, and transcriptomics features.
There are 7,079 biological functions in Arabidopsis that genes contribute to. I.e., these naturally translate into hyperedges.
149,864 total connections between all genes and all hyperedges.
861,888 total edges between all genes for the simple graph when connecting top-10 most similar nodes. We run into memory and storage issues when connecting with more simple edges.
We measured gene expression for 24 Arabidopsis plant specimens.

Transcriptomics Data

For the transcriptomic data, we have four mutant groups of plant specimens across the 24 plants, described here. Each of the four groups contains six plants. We focus on three individual genes--FAD7, EX1, and EX2.

Col: The control group; no mutations.
FAD7: Mutant in solely the FAD7 gene.
EX1EX2: Mutant in the EX1 and EX2 genes.
FAD7EX1EX2: Mutant in all three focus genes.

We compare these four groups in six comparison scenarios. We measure how gene expression changes (goes up--upregulated, or goes down--downregulated, or no change) and train graph neural networks on these scenarios.

EX1EX2 vs Col
EX1EX2 vs FAD7
FAD7EX1EX2 vs Col
FAD7EX1EX2 vs EX1EX2
FAD7EX1EX2 vs FAD7
FAD7 vs Col

A Multimodal Dataset on the Genome of the Model Plant Arabidopsis thaliana with Application Towards Hypergraph Learning

Abstract

Highlights

A visualization of translating the biological functions the genes within Arabidopsis thaliana contribute to, and how such groups translate into the hypergraph structure. Shapes are for illustrative purposes.

Arabidopsis Dataset Statistics

Transcriptomics Data

The training procedure. Some nodes are indexed for training while others are used to evaluate the model while all nodes aggregate information from their edge neighborhoods.

BibTeX