Title: Using Machine Learning to Predict Novel Gene Regulatory Interactions During Candida albicans Biofilm Development
Speaker: Akshay D. Paropkari
Abstract: Biofilms, surface-adhered compendiums of microbial cells, are an important virulence trait of microorganisms that are estimated to be the cause of the majority of infections. Candida albicans is a common fungal pathogen of humans, capable of forming biofilms. C. albicans biofilm formation has four stages – (a) adherence to a surface, (b) initiation of the formation of distinct morphological cell types, (c) formation of the extracellular matrix and (d) dispersal of cells to seed new sites. Transcription factors (TFs) are crucial in controlling gene expression during many developmental pathways but the comprehensive roles of the many TFs involved in C. albicans biofilm development have yet to be characterized. Previously, our lab identified six core TFs required for the formation of mature biofilms in C. albicans – Bcr1, Brg1, Efg1, Ndt80, Rob1, and Tec1.In this study, we use a validated set of TF binding sites (TFBS) to predict novel TF-gene interactions during biofilm formation in C. albicans. First, target sequences were created using previously identified TFBS consensus sequences that represent potential binding sites. The number of sequences for each TF depended on both the number of validated sites as well as the fidelity of the motifs and ranged from a few hundred (for Tec1) to over a million (for Rob1). Second, a feature matrix was built to capture the shape and sequence qualities of each candidate TFBS. Next, a positive set of potential TFBS was identified using a support vector classifier. For each of the six TFs, novel TF-gene interactions were observed. Finally, active and inactive TF-gene interactions were identified by correlating novel TF-gene interactions with time-series gene expression data. By coupling DNA sequence and shape information, we predicted novel gene regulatory interactions occurring during C. albicans biofilm formation. Utilizing these new approaches, this framework provides a granular view of C. albicans biofilm formation throughout the course of biofilm formation.
In this project, we combine genome-wide DNA binding and RNA-seq data to create a prediction framework. Due to the absence of negative training data, we simulated negative data while preserving relevant biological context. In order to identify all potential binding sites, we cast a wide net in obtaining TF binding regions using BLASTn. Further analyses and experimental validation of these new interactions will lead to greater insight into the transcriptional control of C. albicans biofilm formation and pathogenesis.