CS 5854: Computational Systems Biology

T. M. Murali

Spring 2025, 2:30pm-3:45pm, Mondays and Wednesdays

Data & Decision Sciences 240

About the Course

What is Computational Systems Biology?
What is the focus of this course?
Who should take this course?
Pre-requisites
Introductory Videos
Course structure

What is Computational Systems Biology?

Cells, tissues, organs and organisms are systems of components whose interactions have been defined, refined, and optimised over hundreds of millions of years of evolution. Computational systems biology is a field that aims at a system-level understanding of biological systems by analysing biological data using computational techniques. Systems biology aims to answer the following key questions by integrating experimental and computational approaches:

What are the basic structures and properties of the biological networks in a living cell?
How does a biological system behave over time under various conditions?
How does a biological system maintain its robustness and stability?
How can we modify or construct biological systems to achieve desired properties?

What is the focus of this course?

A focus of the course this year will be the study of foundational machine learning models of systems biology, through which we will learn about many different computational problems of interest in molecular and cellular biology.

Who should take this course?

You should take this course if you are curious to find out how the latest research is shaping our understanding of how the living cell behaves as a system. The course will cover the latest research in computational systems biology. We will spend a significant part of the course on examining how the analysis of single-cell datasets and other high-throughput data is crucial to progress in this area. The course is geared towards graduate students whose main research interest is bioinformatics or who use bioinformatic tools and techniques in their research.

There are many exciting and profound issues that researchers in this area are actively investigating. During this course, we will come across many interesting open research problems. Taking this course might be an excellent way to create research topics and projects for your Master's or Ph.D. thesis in the area of bioinformatics/computational biology. In this course, you will be able to communicate and work with students and researchers with varied backgrounds. In addition, Virginia Tech is humming with research activities in this area.

Pre-requisites

Computer Science graduate students: the Data and Algorithm Analysis (CS 4104) or similar course is a pre-requisite. It will help if you also have taken Algorithms in Bioinformatics (CS 5124) and a course on combinatorics and graph theory such as Applied Combinatorics (MATH 3134). An introductory molecular biology course such as Biological Paradigms for Bioinformatics will provide extremely useful biological background.

Life science graduate students: I expect that you have taken courses in biochemistry, cell biology, and molecular biology. A course like Computation for Life Sciences (CS 5045) provides very useful computational background.

Introductory Videos

For students with computational backgrounds, I have listed some videos below that provide introductions into molecular and cell biology.

The Cell (7:21 min): an overview of cell structure from Nucleus Medical Media

The DNA learning center has created several interesting videos.
- Cell signals (I played this video in class)
- The Central Dogma

The textbook "Biology" by Raven, Johnson, Losos, and Singer has several relevant videos:
- Cell communication
- Gene regulation
iBiology, an NSF- and NIGMS-funded intitiative to convey the excitement of modern biology and the process by which scientific discoveries are made has created several videos.
- A flipped course on Cell Biology
- Introduction to Transcription (19:06 min)
- Controlling the Cell Cycle (three parts)
Experimental techniques

Course structure

The course will primarily be driven by lectures and by seminars where one or more students present a related group of papers from literature. I will try to arrange papers that cover both biological and computational aspects. Ideally, I would like a group to contain students with backgrounds in computer science, mathematics, and/or statistics and students with backgrounds in biology and chemistry.

Your grade will depend on your presentation (20%), on class participation (30%), and a final project (50%). The final project is a group software project. I will define software projects that are inspired by the papers you present in class. The project will involve creating some new software or using existing software innovatively combined with some intensive biological analysis of the results. You are welcome to suggest a project to me.

Table 1: **Schedule (subject to change throughout the semester).** Links in the "Topic and Papers" column point to specific papers assigned for each class. Links in "Presenter" column point to the slides for the lecture.
Date	Topic and Papers	Presenter(s)
Jan 22, 2025	Introduction to Computational Systems Biology	T. M. Murali
Jan 27, 2025	Introduction to Computational Systems Biology, continued. Discussion of papers	T. M. Murali
Jan 29, 2025	Pathways on demand: automated reconstruction of human signaling networks	T. M. Murali
Feb 3, 2025	Pathways on demand, continued	T. M. Murali
Feb 5, 2025	Pathways on demand, continued	T. M. Murali
Feb 10, 2025	Pathways on demand, continued	T. M. Murali
Feb 12, 2025	Class cancelled by university	T. M. Murali
Feb 17, 2025	Pathways on demand, continued	T. M. Murali
Feb 19, 2025	VirProBERT: A Language Model for Virus Host Prediction	T. M. Murali
Feb 24, 2025	VirProBERT, continued	T. M. Murali
Feb 26, 2025	VirProBERT, continued, Class projects	T. M. Murali
Mar 3, 2025	Transfer learning enables predictions in network biology	Group discussion
Mar 5, 2025	Transfer learning enables predictions in network biology, continued	Group discussion
Mar 10, 2025	No class (Spring break)
Mar 12, 2025	No class (Spring break)
Mar 17, 2025	Introduction to single-cell RNA-seq data	T. M. Murali
Mar 19, 2025	scBERT as a large-scale pretrained deep language model for cell-xtype annotation of single-cell RNA-seq data	Group discussion
Mar 24, 2025	scGPT: toward building a foundation model for single-cell multi-omics using generative AI	Group discussion
Mar 26, 2025	Large-scale foundation model on single-cell transcriptomics	Group discussion
Mar 31, 2025	Discussion of class projects
Apr 2, 2025	Class cancelled
Apr 7, 2025	scBERT as a large-scale pretrained deep language model for cell-xtype annotation of single-cell RNA-seq data	Sina Heidari
Apr 9, 2025	scBERT as a large-scale pretrained deep language model for cell-xtype annotation of single-cell RNA-seq data	Sina Heidari
Apr 14, 2025	scGPT: toward building a foundation model for single-cell multi-omics using generative AI	Kateland Sipe and Sanchit Kabra
Apr 16, 2025	scGPT: toward building a foundation model for single-cell multi-omics using generative AI	Kateland Sipe and Sanchit Kabra
Apr 21, 2025	Large-scale foundation model on single-cell transcriptomics	Kateland Sipe and Sanchit Kabra
Apr 23, 2025	Large-scale foundation model on single-cell transcriptomics	Kateland Sipe and Sanchit Kabra
Apr 28, 2025
Apr 30, 2025
May 5, 2025	Final project presentations
May 7, 2025	Final project presentations

Reading Notes

PathLinker (January 29, 2025 to February 12, 2025)

Assigned for January 29, 2025 to February 12, 2025 Pathways on demand: automated reconstruction of human signaling networks, Anna Ritz, Christopher L Poirel, Allison N Tegge, Nicholas Sharp, Kelsey Simmons, Allison Powell, Shiv D Kale, and T M Murali, npj Systems Biology and Applications, volume 2, Article number: 16002 (2016)
- Read the "Editorial Summary" on the right hand side.
- Read the "Abstract".
  - As you read it, think about whether there is an algorithm that you already know that might do what PathLinker does, at least to the extent that you can glean its functionality from a few sentences.
  - You may not know what NetPath and KEGG are. Don't worry. I will explain.
  - You can ignore the sentences about CFTR.
  - There are some claims about PathLinker's success. As you read the paper and listen to my lectures, ask yourself if you are convinced that these claims are accurate.
- Now, it is time for the "Introduction". There may be many terms here and in the rest of the paper that are new and whose meaning is unclear. Not a problem. We will discuss all the important ones in class. As always, when in doubt, post a query on Piazza!
  - What does PathLinker do? Does Figure 1 help you figure out the answer to this question?
  - Think of "receptors" as starting points of a journey (sources) and transcription factors as destinations (targets). Do you now have a better sense of what PathLinker does? What algorithms might you use for this task?
  - There is something "human" and "yeast". You can ask a question during class to clarify this difference.
  - The Introduction then states two properties that a good reconstruction algorithm must have. As a reader, it makes sense to have the expectation that PathLinker will have both these properties but other, competing algorithms will lack one or both. You will have to check whether this expectation works out as you read the paper.
  - Further down, the paper is fairly explicit about what PathLinker computes. What are the inputs to PathLinker? What is the output? How is the output related to the input? Does this problem make sense to you? Do you know of an algorithm that solves the problem?
  - Immediately, the authors claim PathLinker satisfies both properties! Can this be true?!
  - The next para extols PathLinker's virtues. There is something about "recall" (there is another important concept called "precision"). Read about them on the Wikipedia.
- Let us read "Results" next. I know, it is strange to read the results before knowing what algorithms the paper compares! But it turns out that this strange order (which is common in bioinformatics and biology journals) is actually quite effective and instructive in the case of this paper.
  - First, you notice that the paper compares PathLinker to many algorithms. We will discuss as many as possible in class. You can certainly follow the link to the Supplementary Materials to read more about the algorithms. But be warned that these descriptions are terse.
  - There is the notion of positives and negatives. Do you understand them? Or at least have an approximate notion of what they may mean?
  - Now comes the main set of results (those important words "precision" and "recall" again). We will discuss each of these results in detail.
  - First, Figure 2(a). What are the authors plotting? What will an ideal precision-recall curve look like? Do the claims made by the authors about the superiority of PathLinker hold true?
  - Now, let us look at Figure 2(b). How did the evaluation change between Figures 2(a) and 2(b). The authors say that they ignored some edges. Do you understand what properties these edges have? What is the takeaway from comparing Figure 2(a) to 2(b)?
  - Try to explain to someone else in the class what Figure 2(c) is all about. Note that the authors are comparing only PathLinker and RWR by this point. Is ignoring the other algorithms warranted?
  - The next few plots (Figures 2e, 2f, 2g) are all about PathLinker. What are the different types of analysis being done? Are you convinced they are useful? Is there some evaluation you would like to make this is missing in this paper?
  - Now comes a complex figure on the Wnt pathway (Figure 3). The left-hand side of Figure 3a is actually part of another figure. Which one is it? What can you say about the algorithms named here?
  - What do the authors conclude from coampring the network layouts in the right of Figure 3(a) to the layout in Figure 3(b)?
  - Ignore the section "Exploring the role of CFTR in Wnt signaling" for now.
- The next section to read is "Materials and Methods".
  - First up, is the "PathLinker" section. There is an explicit description of the inputs to the pathway reconstruction problem. Make sure this list matches what you have gleaned from reading this paper so far.
  - What range can the weight of an edge take? Is it sensible? Does this range have a bearing on any step of the algorithm?
  - How do you decipher "the $k$ highest scoring loopless paths that begin at any receptor in $S$ and terminate at any TR in $T$"? Key phrases are "highest scoring path", "loopless", and "any".
  - Did the description of PathLinker put the cart before the horse? Where is the score of a path defined? It comes in the next sentence.
  - Then the text describes how PathLinker modifies the original graph in two ways. First is the addition of an artificial source and sink. This change is relevant to the word "any" above.
  - Second is the transformation of the weight of each edge. The text explains the reason for this transformation. Of course, to understand the explanation, you have to read both the paper on Yen's algorithm and Supplementary Section S6. Do your best to understand both. I will give you a high-level, informal sketch of Yen' algorithm in class.
  - The final paragraph of "PathLinker" includes the sentence: "By construction, the interactions in the k shortest paths are a subset of those in the (k+1) shortest paths." What is the significance of this sentence.
  - Does the value of $k = 20,000$ seem very high?
- Read the "Datasets" section.
  - Scan citations 3, 4, 5, and 40 to determine what types of interactions the human protein network contains.
  - You can read citation 17 on how to compute edge weights, if you are interested. I will skip over this computation.
  - What can you learn about how the authors determine the receptors and TRs in each pathway?
- Let us move on to "Wnt pathway reconstructions".
  - What is the connection between the points marked with pinkish-red dots on the left of Figure 3a and the network layouts on the right of Figure 3a?
  - Which networks on the right of Figure 3a match your conception of a pathway?
  - Now look at Figure 3b. What do the numbers on each node mean? What do the node colours (green, pink, orange, gray) mean? What strategy did the authors use to assign these colours? Where will you find information on this strategy? I will ask detailed questions in class.
  - Why did the authors focus on RYK-CFTR-DAB2?
  - Now we will consider "Exploring the role of CFTR in Wnt signaling" to come.
    - There are two types of experiments that measure the output of the Wnt pathway. What are these?
    - Each bar except the first in Figure 4a corresponds a specific Wnt protein. Why does the $y$-axis start at 1? What can you conclude from this $y$-range? What is the main purpose of this experiment?
    - Several subsequent figures use siRNA-mediated silencing. siRNAs are a fascinating, Nobel-prize winning concept. I encourage you to read about them so that you know their main function in these experiments, at the very least.
    - What is the purpose of the experiments in Figure 4b? The "Anti" in the names of the different plots refers to antibodies used to measure the level of the respective protein. What do you glean from the statement "In the No Wnt control cells, cellular levels of β-catenin … increased as cellular protein levels of Dab2 decreased"?
    - The real tests of the predictions are in Figure 4c,d.e! Figures 4c/d are the counterpart of Figure 4a but with siRNAs for Ryk, CFTR, and Dab2. What should the white bars in Figure 4c/d correspond to it in Figure 4a?
    - Read the text in the paper and see if you can make sense of the results in Figure 4c,d/e. There are two parts.
    - First, what happens in the absence of Dab2 or CFTR?
    - Second, what happens (differently from Dab2 and CFTR)) in the absense of Ryk?
    - Now understand the model presented in Figure 5. Does it correspond to the results in Figure 4?

Author: "T. M. Murali"

Created: 2025-03-23 Sun 10:01