Final Projects
Fall 2019
This year, the class worked as a group to predict potential regulators of non-muscle myosin II (NMII) in Drosophila. This is "version 2" of the Fall 2017 course project. The class wrote the IEEE-formatted report using LaTeX.
Bio331 Fall 2019 Group Project Report
Emily Crook (biology major)Madeline Doak (neuroscience major)
Miffy Guo (biology major)
Jonah Kohn (neuroscience/CS interdisciplinary major)
Tunc Kose (biology/CS interdisciplinary major)
Hannah Mead (computer science major)
Gabe Preising (biology major)
Tobias Rubel Janssen (philosophy major)
Sol Taylor-Brill (biology major)
Karl Young (biology major)
Julia Yuan (biology major)
The class implemented twelve methods to identify candidate regulators. The report describes the following graph algorithms:
- Common neighbors of known NMII regulators (which are known as ''positives'').
- Proteins with a similar clustering coefficient to positives.
- Random walk using a lazy breadth-first search.
- Low-degree-first search.
- Proteins on the shortest paths among pairs of positives.
- Proteins on short paths from the ligand (Fog) to the regulatory light chain of NMII (Sqh) that include many positives. [view the graph]
- Proteins that increasingly appear in multiple short paths from Fog to Sqh.
- Proteins on the spanning tree of the metric closure of positives.
- Random walk on the minimum Steiner tree approximation. [view the graph]
- Random walk on the minimum Steiner tree approximation after removing known negatives (proteins likely not associated with NMII regulation). [view the graph]
- Gaussian smoothing.
- Node2Vec similarity: proteins that are (cosine) similar to (a) Sqh and (b) multiple positives.
These methods intentionally vary in sophistication and usefulness; the class then considered combining the outputs of these methods in different ways to identify a "consensus" set of predictions using shortest paths and random walk algorithms. For example, this graph shows all candidates suggested by the methods, where nodes are colored by frequency of random walk visits (red is low-frequency, blue is high-frequency). The final recommended set of candidates for follow-up experiments in Derek Applewhite's BIO372 course were
- Gro, Drk, N (suggested by 5 methods)
- Flw (suggested by 4 methods)
- Ci, Ubi-p63E, Spn, Hsp-83 (suggested by 3 methods)
Read the full 9-page final report.
Fall 2017
This year, the class worked as a group to computationally predict potential regulators of non-muscle myosin II (NMII) in Drosophila. They worked together to ultimately recommend a list of candidate proteins, some of which were experimentally tested in Derek Applewhite's BIO372 in spring 2018.
Survey of Potential Regulators in the Fog/Non-Muscle Myosin-II Pathway
Miriam Bern (biology major)Wyatt Gormley (biology major)
Elaine Kushkowski (biology major)
Kathy Thompson (computer science major)
Logan Tibbetts (neuroscience major)
Abstract: Determining proteins that are involved in signaling pathways is essential for understanding complex cellular and developmental processes. However, wet lab approaches for identifying unknown members of a pathway can be costly. Using a computational approach to identify potential regulators from known members of a pathway and previously compiled protein-protein interactome networks saves time and money, and provides potential proteins that can be validated in a laboratory setting. We use Steiner tree approximation and breadth-first search (BFS) algorithms to identify potential regulators of apical constriction that participate in the Fog pathway. Overlap between our applications of both Steiner Tree approximation and BFS indicate the genes coding for Ubiquitin 63E (fb0003943), Spinophilin (fb0010905), Netrin B (fb0015774), and Casein Kinase IIalpha (fb0264492) as our primary candidates of interest. The extended lists produced by these algorithms also include 27 (Steiner Tree approximation) and 26 (BFS) additional candidate genes for a total list of 57 genes of interest.
Read the full 8-page final report.
See the code in the course GitHub repository.
Fall 2016
This year, final projects explored some computational aspect of biological networks. They analyze a real-world biological network (or build a biological network from experimental data) with a significant Python programming component.
Finding Non-LEE Targets of PerC in E. Coli - Vikram Chan-Herur
Diarrheal disease is a major cause of mortality worldwide; pathogenic Escherichia coli, such as the enteropathogenic strain (EPEC), are a cause of diarrhea. In an effort to predict more non-LEE targets of EPEC's PerC regulatory protein from existing RNA-seq differential gene expression data, these data were merged, as weights, into a recently built E. coli K-12 protein-protein interaction network. A novel node-choosing step was incorporated into an existing greedy clique partitioning algorithm and used to find cliques in the network, with varying consideration given to the gene expression data and the node degree. Gene ontology enrichment of the largest resulting cliques showed differences in enriched terms with varying incorporation of expression data. Some of these changes, such as cell cycle terms, were consistent with experimental observations, while others, such as a lack of fimbrial genes, were opposite experimental and RNA-seq-based expectations. Integrating these transcriptomic and proteomic datasets through graphs remains an important question.
Mission Control: An Open Source Usability Package for GraphSpace in Python - Nick Franzese
GraphSpace is a highly customizable platform for graph visualization with a suite of helpful features. These features offer great potential for a variety of academic uses, but actualization of this potential is dampened by usability issues. Formatting data for visualization with GraphSpace can be daunting for those unfamiliar with coding, and can be a chore even for experienced programmers as each graph requires custom built code to visualize. For this reason I created Mission Control, an open source usability package for GraphSpace in Python. The package aims to significantly lower the usability barrier of GraphSpace while maintaining customizability. In the present paper I detail the user API of the Mission Control package and showcase a few graphs that I was able to visualize quickly and effortlessly through the package. [report.pdf] [GitHub Repo]
Node Stability of Matrilineal Groups in a Killer Whale Social Network - Amy Rose Lazarte
The focus of this project was to explore how each killer whale matriline contributed to the stability of an entire whale social network. Centrality measures are commonly used to rank the importance of individual nodes on a graph, and this project attempted to combine several different centrality measures in order to definitively compare each matrilineal groups' contribution to the stability of the entire network. While the project originally hoped to draw conclusions about common demographics (size, presence of calves, presence of males, etc) found in the most essential matrilines, the nodes of interest did not appear to show any demographic similarities. However, matrilines that interacted outside of their pod added the most stability to the graph.
Parallel Programming with Prim's Algorithm - Erik Lopez
Parallel Programming can be a very useful way to work through big data sets and get results much quicker than had you used a serial implementation of an algorithm. Not only can it be more efficient but it can also push the architecture of your system to the maximum. I will explore what multithreading and multiprocessing python modules can do for us when using them on an embarrassingly parallel problem and then scale up to Prims, a minimum spanning tree graph algorithm. [report.pdf] [GitHub Repo]
Modularity and the Louvain Algorithm - Yasmina Marden
I applied the Louvain algorithm to two datasets and found that node ordering had a notable effect on the outputted clusterings. I then ran simulations with random node order in order to find the most frequently outputted clustering. This clustering, however, was not only not the optimum clustering, according to modularity, but a poor clustering with a significantly lower modularity than the optimum clustering. This poor clustering better-reflected the optimum clustering if it was parsed for its sub-clusters. In order to generate more accurate results from the algorithm without parsing, I then ran more simulations with random node order and recorded the node pairs that appeared most frequently in the same cluster. Creating a clustering from these most frequent node pairings led to increased accuracy, although results deteriorated after approximately 100 simulations.
Grouping Badger Social Networks - Karl Menzel
Understanding badger social networks can be important for understanding the distribution of tuberculosis in the badger populations. To understand these networks, I used interaction data from 51 badgers in multiple setts. I calculated both node and edge flow-betweenness for the network using the Ford-Fulkerson method. I also attempted to cluster the badgers into the defined social groups using the Girvan-Newman method. The clustering did not fully accurately represent the given social groups using either the raw data or the edge flow-betweenness. [report.pdf] [GitHub Repo]
Clique Finding in Tetraselmis Subcordiformis - Eli Spiliotopoulos
In this project a predicted protein-protein interactome was seeded with transcriptome read data. This seeded data was taken and read through a maximal clique algorithm to find maximal cliques that included these known to be higher expressed genes. The goal of this project was to identify higher expressed gene groups through these more related groups of genes within the larger set. [report.pdf] [GitHub Repo]