M.Tech Research Thesis {Colloquium}: CDS : “Scalable read alignment algorithm for cyclic pangenome graphs”


31 Jan 24    
2:00 PM - 3:00 PM

Event Type

M.Tech Research Thesis Colloquium

Speaker : Ms. Jyotshna Rajput

S.R. Number : 06-18-01-10-22-21-1-19943

Title :”Scalable read alignment algorithm for cyclic pangenome graphs”
Research Supervisor: Dr. Chirag Jain
Date & Time : January 31, 2024 (Wednesday) at 02:00 PM
Venue : # 102 CDS Seminar Hall


The growing availability of genome sequences of several species, including humans, has created the opportunity to utilize multiple reference genomes for bioinformatics analyses and improve the accuracy of genome resequencing workflows. Graph-based data structures are suitable for compactly representing multiple closely related reference genomes. Pangenome graphs use a directed graph format, where vertices are labeled with strings, and the individual reference genomes are represented as paths in the graph. Aligning sequences (reads) to pangenome graphs is fundamental for pangenome-based genome resequencing. The sequence-to-graph alignment problem seeks a walk in the graph that spells a sequence with minimum edit distance from the input sequence. However, exact algorithms for solving this problem are unlikely to scale due to the known hardness of this problem. Co-linear chaining is a well-studied and commonly used heuristic alternative for quickly aligning reads to a graph. However, the known chaining algorithms are restricted to directed acyclic graphs (DAGs) and are not trivially generalizable to cyclic graphs. This limitation must be addressed because pangenome graphs often contain cycles due to inversions, duplications, or copy number mutations within the reference genomes.

This thesis presents the first practical formulation and algorithm for co-linear chaining on cyclic pangenome graphs. Our algorithm builds upon the known chaining algorithms for DAGs. We propose a novel iterative algorithm to handle cycles and provide rigorous proof of correctness and runtime complexity. Our algorithm also exploits the domain-specific small-width property of pangenome graphs. The proposed optimizations enable our algorithm to scale to large human pangenome graphs. We implemented our algorithm in C++ and referred to it as PanAligner (https://github.com/at-cg/PanAligner). PanAligner is an end-to-end long-read aligner for pangenome graphs. We evaluated its speed and accuracy by aligning simulated long reads to a cyclic human pangenome graph comprising 95 haplotypes. We achieved superior read mapping accuracy using PanAligner compared to existing methods.