CS 3824: Introduction to Computational Biology and Bioinformatics

T. M. Murali

Spring 2022, 3:30pm-4:45pm, Tuesdays and Thursdays, NCB 110A

Office hours: 3pm-5pm, Mondays and Wednesdays, Torg 2160B

GTA: Monjuri Afrin Rumi

About the Course

What is the focus of this course?
Pre-requisites
Course structure
Grading
Introductory Videos

What is the focus of this course?

This course provides an introduction to computational biology and bioinformatics (CBB) through hands-on learning experiences. The emphasis is on problem solving in CBB. A breadth of topics may be covered including structural bioinformatics, modeling and simulation of biological networks, computational sequence analysis, algorithms for reconstructing phylogenies, computational systems biology and data mining algorithms. In Fall 2022, the focus will be on the field of network biology, which is broadly defined as network, machine learning, and data mining algorithms used in computational biology.

The goal of this course is to teach you computational methods that scientists use to understand the inner workings of the cell from a network-centric viewpoint. What types of processes take place within the cell? How do we measure and capture them experimentally? How do we represent these processes in the computer as networks? What information do we lose in this transformation? How can we then analyze these networks using techniques from computer science to answer fundamental biological questions. We will primarily use the tools of graph (network) theory but will also occasionally delve into machine learning and data mining. There is no textbook for the course. The main source materials are the lecture slides, the lectures themselves, and supplementary reading materials.

Pre-requisites

The course is open to students with Computer Science (CS), Computer Engineering (CPE), Data-Centric Computing (DCC), and Secure Computing (SC) majors and CS minors. You must have a grade of C or better in Data Structures and Algorithms (CS 3114). A knowledge of some advanced data structures and basic graph algorithms will help.

Course structure

The course will consist of lectures by the instructor and presentations by student groups. There will be 4–5 programming assignments and a group software project. The student presentations will be based upon the projects. There will be no exams.

Lectures

After an introduction to molecular biology and after rationalising the use of networks to represent cellular processes, lectures will cover a range of topics in the field such as gene function prediction, clustering, and reconstructing pathways. There is a detailed course schedule that I will fill up and update regularly over the course of the semester.

Assignments

A typical assignment will involve writing code to solve one–three related problems inspired by recent class lectures. These assignments may organically come about from class discussions. You will have about two weeks to complete each assignment. Your solution will include a link to your code on GitHub, a report on the results of your analysis, including the figures, discussion of difficulties you faced, how you solved them, and observations on your results.

Software Projects

By mid-semester, I will describe several ideas for software projects. Students can form groups (two-three students) and work on projects together. Details on the projects themselves and the selection process will be forthcoming. Each group will make two presentations: the first one will describe the project and the proposed strategy while the second one will describe the final results.

Grading

Assignments and the group project will contribute to your final score as follows.

Activity	Weight
Assignments	60%
Group project	40%

The following scale displays how your score will map to final letter grades for the course:

Grade	Score
A	>93%
A-	>90%
B+	>85%
B	>80%
B-	>76%
C+	>72%
C	>68%
C-	>65%
D+	>62%
D	>59%
D-	>56%
F	>0%

Attendance

I will not be recording your attendance. However, I expect and very strongly encourage you to attend every class: the lectures will describe advanced graph algorithms that you may not have encountered. Attending lectures will prepare you for the performing well on the assingments and the group project.

Late Policy

There is no specific late policy for assignments. Two weeks should be sufficient to complete each assignment. I will consider requests for late submissions on a case-by-case basis, while reserving the right to reduce points.

The Virginia Tech Honor Code

The Undergraduate Honor Code pledge that each member of the university community agrees to abide by states:

“As a Hokie, I will conduct myself with honor and integrity at all times. I will not lie, cheat, or steal, nor will I accept the actions of those who do.”

Students enrolled in this course are responsible for abiding by the Honor Code. A student who has doubts about how the Honor Code applies to any assignment is responsible for obtaining specific guidance from the course instructor before submitting the assignment for evaluation. Ignorance of the rules does not exclude any member of the University community from the requirements and expectations of the Honor Code.

Academic integrity expectations are the same for online classes as they are for in person classes. All assignments submitted shall be considered “graded work” and all aspects of your coursework are covered by the Honor Code. All projects and homework assignments are to be completed individually in this course. All written work must be written without help from other sources or people, except for the course instructor, the course TAs, and Student Success Center tutors.

Commission of any of the following acts shall constitute academic misconduct. This listing is not, however, exclusive of other acts that may reasonably be said to constitute academic misconduct. Clarification is provided for each definition with some examples of prohibited behaviors in the Undergraduate Honor Code Manual:

CHEATING
Cheating includes the intentional use of unauthorized materials, information, notes, study aids or other devices or materials in any academic exercise, or attempts thereof.
PLAGIARISM
Plagiarism includes the copying of the language, structure, programming, computer code, ideas, and/or thoughts of another and passing off the same as one's own original work, or attempts thereof.
FALSIFICATION
Falsification includes the statement of any untruth, either verbally or in writing, with respect to any element of one's academic work, or attempts thereof.
FABRICATION
Fabrication includes making up data and results, and recording or reporting them, or submitting fabricated documents, or attempts thereof.
MULTIPLE SUBMISSION
Multiple submission involves the submission for credit of substantial portions of any work (including oral reports) previously submitted for credit at any academic institution, or attempts thereof.
COMPLICITY
Complicity includes intentionally helping another to engage in an act of academic misconduct, or attempts thereof.

Note that all electronic work submitted for this course is archived and subjected to automatic plagiarism detection and cheating analysis!

If you have questions or are unclear about what constitutes academic misconduct on an assignment, please speak with your instructor. We take the Honor Code very seriously in this course. The normal sanction we recommend for a violation of the Honor Code is an F* sanction as your final course grade. The F represents failure in the course. The identifies “*” a student who has failed to uphold the values of academic integrity at Virginia Tech. A student who receives a sanction of F* as their final course grade shall have it documented on their transcript with the notation “FAILURE DUE TO ACADEMIC HONOR CODE VIOLATION.” You are required to complete an education program administered by the Honor System in order to have the “*” and notation “FAILURE DUE TO ACADEMIC HONOR CODE VIOLATION” removed from your transcript. The “F” however would be permanently on your transcript.”

Virginia Tech Community Wellness Commitment

Virginia Tech is committed to protecting the health and safety of all members of its community. By participating in this class, all students agree to abide by the Virginia Tech Wellness principles (https://ready.vt.edu/public-health-guidelines.html#wellness) . Be respectful of the well-being of others, as well as individual choices about masking. Students who prefer to wear masks in class are always welcome to do so.

Student Well-Being Support

Supporting the mental health and well-being of students in our class is of high priority to us and Virginia Tech. If you are feeling overwhelmed academically, having trouble functioning, or are worried about a friend, please reach out to any of the following offices:

Cook Counseling:
- 540-231-6557 to schedule an appointment and/or 24/7 crisis support
- ucc.vt.edu for more information
Dean of Students Office:
- 540-231-3787 for general advice
- 540-231-6411 for after-hours crisis
- dos.vt.edu for more information
Hokie Wellness:
- hokiewellness.vt.edu for more information about health and wellness workshops and consultations
Services for Students with Disabilities (SSD)

540-231-3788 or ssd.vt.edu for more information about accommodations and other disability-related supports
Any student who has been confirmed by the University as having special needs for learning must notify me by email in the first week of the course--work through SSD so they can send electronic notification of any accommodations.

Student Success Center:
- The Student Success Center helps students develop the skills needed to accomplish their academic goals and become self-directed learners. Their free services include individual and group tutoring, peer academic coaching, a Seminar Series on Academic Success, and more. Students can book appointments through Navigate. For instructions and more information, please visit www.studentsuccess.vt.edu.

For a full listing of campus resources check out well-being.vt.edu.

Please also feel free to speak with any of the instructors. We will make an effort to work with you; we care about you.

Academic Accommodations

Virginia Tech welcomes students with disabilities into the University’s educational programs. The University promotes efforts to provide equal access and a culture of inclusion without altering the essential elements of coursework. If you anticipate or experience academic barriers that may be due to disability, including but not limited to, chronic medical conditions, Deaf or hard of hearing, learning disability, mental health, or vision impairment, please contact the Services for Students with Disabilities (SSD) (540-231-3788, ssd@vt.edu, or visit ssd.vt.edu). If you have an SSD accommodation letter, please meet with your instructor privately during office hours as early in the semester as possible to discuss implementing your accommodations. You must give reasonable notice to implement your accommodations, which is generally 5 business days and 10 business days for final exams.

Technical support

Technical: For technical support assistance regarding any problems with Canvas, Zoom, or e-mail, please contact 4Help. For questions related to programming assignments, lectures, or projects, please ask questions on Piazza, where the instructor or the TA can provide answers.

Canvas privacy policy: http://www.canvaslms.com/policies/intl-privacy.

Schedule

Table 1: **Schedule (subject to change throughout the semester).** Links in the "Reading" column point to specific chapters to be discussed in each class. Links in "Presenter" column point to the slides for the lecture.
Topic	Date	Reading
Introduction to Computational Biology and Bioinformatics	Aug 23, 2022
Introduction, continued; Introduction to Graphs	Aug 25, 2022
Introduction to Graphs, continued	Aug 30, 2022
Pathways on demand: automated reconstruction of human signaling networks	Sep 1, 2022
Pathways on demand, continued	Sep 6, 2022
Pathways on demand, continued	Sep 8, 2022
\(k\) shortest loopless paths and Pathways on demand, continued	Sep 13, 2022
Network modules	Sep 15, 2022
Network modules, continued	Sep 20, 2022
Class cancelled by university	Sep 22, 2022
Louvain and Leiden algorithms	Sep 27, 2022
Louvain and Leiden algorithms, continued	Sep 29, 2022
Identifying human interactors of SARS-CoV-2 proteins	Oct 4, 2022
Identifying human interactors of SARS-CoV-2 proteins, continued	Oct 6 2022
Identifying human interactors of SARS-CoV-2 proteins, continued	Oct 11, 2022
Gene function prediction	Oct 13, 2022
Gene function prediction, continued	Oct 18, 2022
Discuss of Assignment 3 and Group projects	Oct 20, 2022
Team meetings	Oct 25, 2022
Team meetings	Oct 27, 2022
Team meetings	Nov 1, 2022
Team meetings	Nov 3, 2022
Team meetings	Nov 8, 2022
Team meetings	Nov 10, 2022
Class cancelled	Nov 15, 2022
Team meetings	Nov 17, 2022
Thanksgiving break, No class	Nov 22, 2022
Thanksgiving break, No class	Nov 24, 2022
Team meetings	Nov 29, 2022
Class cancelled	Dec 1, 2022
Team project presentations	Dec 6, 2022

Introductory Videos

These videos discuss general molecular and cell biology.

The Cell (7' 21"): a general overview of cell structure from Nucleus Medical Media
The Structure of DNA (5' 59")
What is DNA and How Does it Work? (5' 24")
What is a Gene? (4' 56")

The DNA learning center has created several interesting videos.
- Cell signals (I played this video in class)
- The Central Dogma (several videos)

Papers

PathLinker (August 30, 2022)
Yen's algorithm (September 13, 2022)
Modularity and Louvain's algorithm (September 15 and 20, 2022)
Leiden's algorithm (September 27 and 29, 2022)
Gene function prediction (October 13 and 18, 2022)

Assigned for August 30, 2022 Pathways on demand: automated reconstruction of human signaling networks, Anna Ritz, Christopher L Poirel, Allison N Tegge, Nicholas Sharp, Kelsey Simmons, Allison Powell, Shiv D Kale, and T M Murali, npj Systems Biology and Applications, volume 2, Article number: 16002 (2016)
- Read the "Editorial Summary" on the right hand side.
- Read the "Abstract".
  - As you read it, think about whether there is an algorithm that you already know that might do what PathLinker does, at least to the extent that you can glean its functionality from a few sentences.
  - You may not know what NetPath and KEGG are. Don't worry. I will explain.
  - You can ignore the sentences about CFTR.
  - There are some claims about PathLinker's success. As you read the paper and listen to my lecture, ask yourself if you are convinced that these claims are accurate.
- Now, it is time for the "Introduction". There may be many terms here and in the rest of the paper that are new and whose meaning is unclear. Not a problem. We will discuss all the important ones in class.
  - What does PathLinker do? Does Figure 1 help you figure out the answer to this question?
  - Think of "receptors" as starting points of a journey (sources) and transcription factors as destinations (targets). Do you now have a better sense of what PathLinker does? What algorithms might you use for this task?
  - There is something "human" and "yeast". You can ask a question during class to clarify this difference.
  - The Introduction then states two properties that a good reconstruction algorithm must have. As a reader, it makes sense to have the expectation that PathLinker will have both these properties but other, competing algorithms will lack one or both. You will have to check whether this expectation works out.
  - Further down, the paper is fairly explicit about what PathLinker computes. What are the inputs to PathLinker? What is the output? How is the output related to the input? Does this problem make sense to you? Do you know of an algorithm that solves the problem?
  - Immediately, the authors claim PathLinker satisfies both properties! Can this be true?!
  - The next para extols PathLinker's virtues. There is something about "recall" (there is another important concept called "precision"). Read about them on the Wikipedia.
- Let us read "Results" next. I know, it is strange to read the results before knowing what algorithms the paper compares! But it turns out that this strange order (which is common in bioinformatics and biology journals) is actually quite effective and instructive in the case of this paper.
  - First, you notice that the paper compares PathLinker to many algorithms. We will discuss as many as possible in class. You can certainly follow the link to the Supplementary Materials to read more about the algorithms. But be warned that these descriptions are terse.
  - There is the notion of positives and negatives. Do you understand them? Or at least have an approximate notion of what they may mean?
  - Now comes the main set of results (those important words "precision" and "recall" again). We will discuss each of these results in detail.
  - First, Figure 2(a). What are the authors plotting? What will an ideal precision-recall curve look like? Do the claims made by the authors about the superiority of PathLinker hold true?
  - Now, let us look at Figure 2(b). How did the evaluation change between Figures 2(a) and 2(b). The authors say that they ignored some edges. Do you understand what properties these edges have? What is the takeaway from comparing Figure 2(a) to 2(b)?
  - Try to explain to someone else in the class what Figure 2(c) is all about. Note that the authors are comparing only PathLinker and RWR by this point. Is ignoring the other algorithms warranted?
  - The next few plots (Figures 2e, 2f, 2g) are all about PathLinker. What are the different types of analysis being done? Are you convinced they are useful? Is there some evaluation you would like to make this is missing in this paper.
  - Now comes a complex figure on the Wnt pathway (Figure 3). The left-hand side of Figure 3a is actually part of another figure. Which one is it? What can you say about the algorithms named here?
  - What do the authors conclude from coampring the network layouts in the right of Figure 3(a) to the layout in Figure 3(b)?
  - Ignore the section "Exploring the role of CFTR in Wnt signaling"
- The next section to read is "Methods". I will have reading notes for you later.
Assigned for September 13, 2022. Yen's algorithm for computing the \(k\) shortest, loopless (simple) paths from a source to a target in a directed graph.
Assigned for September 15 and 20, 2022
- Read the Wikipedia page on Modularity.
  - After reading the "Definition" section, see if you can write down a formula for the modularity.
  - Read "Expected Number of Edges Between Nodes" very carefully. It is somewhat complex but critical to this definition and the "magic" that makes it work well.
  - Make sure you see how the definition changes from equation (3) to (4), i.e., from two to more than two communities.
  - You can skip the section on "Matrix formulation"
  - You can skip the sections on "Resolution limit" and "Multiresolution methods" for now.
- Now read the Wikipedia page on the so-called Louvain algorithm
  - The connection to modularity should be apparent. In fact, the "Algorithm" section spends the first part defining modularity.
  - Do you understand the difference between \(Q\) and \(Q_c\)?
  - Now read the formula for \(\Delta Q\) very carefully and try to understand at least the terms in it.
  - In class, we will derive a much simpler formula for \(\Delta Q\).
- The next reading target is the paper From Louvain to Leiden: guaranteeing well-connected communities
  - Jump right to the supplementary text for this paper.
  - Read Algorithm A.1 on page 3. It contains very nice pseudocode for the Louvain algorithm.
  - There are three functions. Louvain should be easy to understand.
  - In MoveNodes, what does the loop in lines 15-18 do? Compare to the Wikipedia page on the algorithm.
  - What could the \({\cal H}\) symbol represent?
Assigned for September 27 and 29, 2022
- Read the paper on the Leiden algorithm.
  - Right at the beginning, there is equation (1) for modularity. Does it differ from the equation we developed? What is the role of the parameter \(\gamma\)? What is the effect of using a very small value of \(\gamma\)? What is the effect of using a very large value of \(\gamma\)?
  - This paper uses a different definition, the Constant Potts Model (CPM). Read the paragraph that follows equation (2) to get an understanding of what this formulation is trying to achieve.
  - What is the value of \({\cal H}\) if we put all nodes in one community and set \(\gamma = 1\).
- Read the section on the Louvain algorithm to refamiliarise yourself with it.
  - What is a "badly connected community"? The paper elucidates the situation in Figure 2 in some detail to explain how the Louvain algorithm may find a disconnected community. But what does the paper say about badly connected communities?
  - Is it clear that this problem with the Louvain algorithm is different from the resolution limit of modularity?
  - What are the guarantees that the Louvain algorithm provides?
- Now read about the Leiden algorithm.
  - What are the key ideas used by the Leiden algorithm (some are old, others are new).
  - Figure 3 is a high-level illustration of the steps of the algorithm. Study it carefully.
  - What does the word "refinement" mean? Clue: "Communities in \({\cal P}\) may be split into multiple subcommunities in \({\cal P}_\mathrm{refined}\)."
  - Do your best to understand how the refined partition is created. The best thing to do is to look at the pseudocode for the algorithm on page 4 of the supplementary text for this paper. We will go over this pseudocode in class.
  - In what way is the "local moving" phrase different between the Louvain and the Leiden algorithm.
  - Read about the guarantees provided by the Leiden algorithm. Where will you find the definitions of these concepts?
Assigned for October 13 and 18, 2022
- This reading assignment is on gene function prediction. There are two distinct concepts we have to learn. One is the notion of the Gene Ontology (GO). The other is the framework for gene function prediction.
- Let us start with the GO. You will read a few chapters from "The Gene Ontology Handbook". The text is not on computer science. It describes a basic concept in biology but it should not be difficult to read. First, read "Chapter 2: The Gene Ontology and the Meaning of Biological Function."
  - Section 1 is somewhat philosophical in nature, distinguishing as it does between the "activity that is a function" and the "entity that performs the function." You can skim this section.
  - Section 2 is important. You know the terms "gene" and "gene product" ("protein" for all intents and purposes). You should also know "complex". "Activity" is a new concept. Try to understand the difference between a function ("professor") and an annotation ("T. M. Murali is a professor").
  - The three aspects (sometimes called namespaces) are critical to know about. Read the rest of Section 2 as carefully as you can.
  - Section 3 is also somewhat philosophical, which you can skim.
- Next, read "Chapter 3: Primer on the Gene Ontology", which will reinforce and elaborate upon the concepts from Chapter 2.
  - The Introduction provides a more direct, practical motivation for the creation of the GO than the previous chapter.
  - Figure 1 is a nice illustration. What type of graph is shown here? Note the multiple types of relationships between terms. Does the "is_a" relationship map to any concept you know in computer science?
  - The top of page 27 gives you an idea of the scope of the GO and how much it can change.
  - Sections 4 and 5 (especially 5) are important since you will parse files containing annotations in the next assignment. Read Figure 2 and the corresponding text in the chapter carefully.
  - Section 5.3 on Evidence Codes is especially important. Ask me questions in class or on Piazza regarding the different types of evidence codes. Figure 3 is very helpful but has already been superseded by more recent developments.
- You can skip Chapter 4.
- Chapter 5 is of interest if you are keen to explore on your own the types of methods that use sequence similarity to assign gene function. Take a look at Figure 1, though, since it is central to what we will discuss in class.
- Chapter 6 is also a "read-on-your-own" exercise.
- Now, turn your attention to a 18-year old paper: Whole-genome annotation by using evidence integration in functional-linkage networks
  - The introduction is passe by now but might still be instructive for someone new to the problem.
  - Read "Methods" very carefully.
    - How many functional linkage networks (FLNs) does the approach construct
    - What is the difference between different FLNs? Understand the notion of a node state carefully and relate it to the structure of the GO.
    - The energy function is similar to functions we have seen before. Can you make a connection? Is the function linear? Quadratic? Before you read further, how will you minimize it? Can we use the ideas we have discussed before, e.g., in the case of the Regularized Laplacian? Why or why not?
    - The paper proposes a local update procedure. Does it make sense? Remind you of a similar approach we have seen in class?
    - On what basis can the authors make the bold claim "For networks with unit cost edge weights, the maximum number of updates is bounded by \(2n^2\)"? How do the authors even know that this procedure will terminate?
  - Read "Results" carefully as well. We will discuss them in some detail. You can ignore the text dealing with support in the literature for the predictions made by the algorithm.

Assignments

Assignment 1, released on September 8, 2022, due by 11:59pm on September 22, 2022
Assignment 2, released on September 29, 2022, due by 11:59pm on October 17, 2022
Assignment 3, released on October 18, 2022, due by 11:59pm on October 25, 2022