The genome of an organism offers a wealth of information vital to understanding how it reacts to outside stimuli. One such model plant, called Arabidopsis thaliana, has been critical to developing biology researchers' understanding of other plants, including crop plants. Meanwhile, in recent years, hypergraphs and deep learning have seen substantial progress toward understanding datasets with higher-order relationships. Gene data and relationships within the genome fit well into the hypergraph structure, where its hyperedges naturally represent biological functions genes are known to contribute to. This is more intuitive than pair-wise edges in simple graphs, where connecting genes is not as straightforward. To the best of our knowledge, there is no work that connects hypergraphs and deep learning towards a complete set of genes as the nodes. To connect the biology and deep learning communities, we bring together different sources of gene and function information into one data package. We provide details over what we call the Arabidopsis dataset as well as go over transcriptomic data from Arabidopsis plants we have assembled. We then provide baseline experimental results to showcase how hypergraph models learn correlations amongst gene features to predict the up- or down-regulation with regards to gene expression. We further provide justification for using hypergraphs over graphs for this dataset. Finally, to address the challenges of our dataset, we discuss the experimental results and offer advice for future directions.
The hypergraph data structure allows for a more compact representation compared to the simple graph.
For the transcriptomic data, we have four mutant groups of plant specimens across the 24 plants, described here. Each of the four groups contains six plants. We focus on three individual genes--FAD7, EX1, and EX2.
We compare these four groups in six comparison scenarios. We measure how gene expression changes (goes up--upregulated, or goes down--downregulated, or no change) and train graph neural networks on these scenarios.
TBD