DEPARTMENT OF COMPUTATIONAL AND DATA SCIENCES
M.Tech Research Thesis Colloquium
Speaker : Mr. Parvesh Barak
S.R. Number : 06-18-01-10-22-23-1-23181
Title : “An error correction algorithm for long-read sequencing”
Research Supervisor :Dr. Chirag Jain
Date & Time : July 02, 2025 (Wednesday) at 02:00 PM
Venue : The Thesis Colloquium will be held on HYBRID Mode
# 102 CDS Seminar Hall /MICROSOFT TEAMS.
Please click this link to join the Thesis Colloquium:
ABSTRACT
The latest, third-generation, long-read sequencing technologies have transformed genomics by generating long and sufficiently accurate sequences that bridge most repeats of an organism’s genome. A single sequencing run can routinely generate terabytes of data, but these instruments also introduce errors during sequencing. The average sequencing error rate ranges from 0.1%-4%, depending on the choice of instrument and protocol. These errors must be corrected while preserving the correctly sequenced bases to reconstruct the genome sequence. In diploid organisms like humans, each individual inherits two copies of the genome, one from each parent. These two copies, called haplotypes, are nearly identical but differ at about 0.1% of genomic positions. Knowing these genetic differences between the two haplotypes is important for biological applications. So, this introduces a challenge of correcting noise (sequencing errors) while preserving the true signal (0.1% genetic variation between the haplotypes). During the last few years, several haplotype-aware error correction methods have been developed. However, existing methods are based on either ad-hoc heuristics or deep learning approaches, which require substantial computational resources and lack formal guarantees.
This thesis presents the first rigorous formulation for this problem. Our approach builds on the minimum error correction framework commonly used in read phasing algorithms. We prove that the proposed formulation for error correction of reads in the de novo setting, i.e., without using a reference genome, is NP-hard. To make our exact algorithm scale to large datasets, we design practical heuristics. Experiments using publicly-available sequencing datasets from human and plant genomes show that our method achieves accuracy comparable to state-of-the-art tools. A Rust implementation of our algorithm is freely available at https://github.com/at-cg/HALE.
ALL ARE WELCOME