A Multimodal Dataset on the Genome of the Model Plant Arabidopsis thaliana with Application Towards Hypergraph Learning

graph-vs-hypergrah-2

An illustration of representing a network of genes using simple graphs and hypergraphs. (A) Nodes representing individual genes contain information specific to the corresponding gene and are initially unconnected. (B) A simple graph, where, where edges can join any two nodes, is one method of connecting all the genes in Arabidopsis thaliana. (C) A hypergraph joins gene nodes based on each genes' known contribution to several biological functions. A hyperedge can join any number of nodes, and any node can be a member of several hyperedges at once since genes may impact to different functions of a biological system, hence the overlap of some nodes shown.

Abstract

The genome of an organism offers a wealth of information vital to understanding how it reacts to outside stimuli. One such model plant, called Arabidopsis thaliana, has been critical to developing biology researchers' understanding of other plants, including crop plants. Meanwhile, in recent years, hypergraphs and deep learning have seen substantial progress toward understanding datasets with higher-order relationships. Gene data and relationships within the genome fit well into the hypergraph structure, where its hyperedges naturally represent biological functions genes are known to contribute to. This is more intuitive than pair-wise edges in simple graphs, where connecting genes is not as straightforward. To the best of our knowledge, there is no work that connects hypergraphs and deep learning towards a complete set of genes as the nodes. To connect the biology and deep learning communities, we bring together different sources of gene and function information into one data package. We provide details over what we call the Arabidopsis dataset as well as go over transcriptomic data from Arabidopsis plants we have assembled. We then provide baseline experimental results to showcase how hypergraph models learn correlations amongst gene features to predict the up- or down-regulation with regards to gene expression. We further provide justification for using hypergraphs over graphs for this dataset. Finally, to address the challenges of our dataset, we discuss the experimental results and offer advice for future directions.

Highlights

  • We organize a dataset for gene information and relevant biological function information from multiple public sources concerning Arabidopsis thaliana, and then release in one package.
  • On top of the gene information, we have assembled transcriptomic data using 24 mutant Arabidopsis plants or specimens.
  • We conduct experiments using simple graph- and hypergraph-backbone and show that hypergraphs, with their more compact edge representation, are able to learn across connections of genes, while simple graphs struggle to do so.
hyperedge-figure-5

A visualization of translating the biological functions the genes within Arabidopsis thaliana contribute to, and how such groups translate into the hypergraph structure. Shapes are for illustrative purposes.

hypergraph-data-2

An illustration of the multimodal data afforded by our Arabidopsis dataset. Genes have corresponding simple textual descriptions that indicate what their function is (along with identifying information, not shown here). Genes also have associated nucleotide sequences, which can be encoded for deep learning models. (There are also more sequences available, not shown here.) Finally, we add our transcriptomics data, where we measured gene expression levels for all genes across different plant specimens. In a preprocessing phase, each of these components are encoded or transformed which make up a row vector, which finally gives us the gene node features.

Arabidopsis Dataset Statistics

The hypergraph data structure allows for a more compact representation compared to the simple graph.

  • 38,286 genes with text descriptions, gene nucleotide sequences, and transcriptomics features.
  • There are 7,079 biological functions in Arabidopsis that genes contribute to. I.e., these naturally translate into hyperedges.
  • 149,864 total connections between all genes and all hyperedges.
  • 861,888 total edges between all genes for the simple graph when connecting top-10 most similar nodes. We run into memory and storage issues when connecting with more simple edges.
  • We measured gene expression for 24 Arabidopsis plant specimens.

Transcriptomics Data

For the transcriptomic data, we have four mutant groups of plant specimens across the 24 plants, described here. Each of the four groups contains six plants. We focus on three individual genes--FAD7, EX1, and EX2.

  • Col: The control group; no mutations.
  • FAD7: Mutant in solely the FAD7 gene.
  • EX1EX2: Mutant in the EX1 and EX2 genes.
  • FAD7EX1EX2: Mutant in all three focus genes.

We compare these four groups in six comparison scenarios. We measure how gene expression changes (goes up--upregulated, or goes down--downregulated, or no change) and train graph neural networks on these scenarios.

  • EX1EX2 vs Col
  • EX1EX2 vs FAD7
  • FAD7EX1EX2 vs Col
  • FAD7EX1EX2 vs EX1EX2
  • FAD7EX1EX2 vs FAD7
  • FAD7 vs Col
all-node-class-tables

Experimental results for node classification comparing frameworks using a graph convolution backbone and another using a hypergraph convolution backbone. All models classify, given different transcriptomics data and additional gene features, whether a gene has been up- or down-regulated, or neither. We report classification accuracy, and then F1, precision, and recall averaged on all three categories.

data-preprocessing-procedure

The data preprocessing procedure for transforming the raw data into the node features. Individual data is transformed into a sequence embedding (e.g., of text of nucleotides) or normalize data (e.g., transcriptomics) before concatenating into the final features. This features matrix is our input to all hypergraph models.

data-preprocessing-procedure

The training procedure. Some nodes are indexed for training while others are used to evaluate the model while all nodes aggregate information from their edge neighborhoods.

BibTeX

TBD