As a final project in my undergraduate studies, I researched a new method to assemble DNA. The data was a collection of ‘reads’ - strings of 100 characters, each symbolising 1 of 4 nucleic acids.
The reads are randomly sampled from the DNA, making it impossible to know which ones came from the same area in the genome.
The common approach is to reduce this problem to the Hamiltonian path problem - each read is a node in a directed graph, while the edges represented the overlapping areas between them. A Hamiltonian path represents the assembled DNA string. Since this problem is np-complete, different heuristics are applied to find the path.
Together with Idan Goldman, I tested a method called Bin Assembly, where the chemical process of sampling the reads also ‘colors’, and allows splitting them into several areas of DNA. We treated the different bins as separate strings, and assembled inner bin contigs. Later we created a directed graph of contigs, and used the Bron Kerbosch Algorithm to find cliques in the graph, concluding that cliques suggest the bins came from a similar locus in the target genome.
The data were taken from the first 500,000 base pairs of the E-coli organism’s genome.
Coverage was 100.
Total number of base pairs in the output: 499,592
Max length of contig assembled: 378,179
N50 length: 378,179
Numbers of assembled contigs: 2
The problem of assembling DNA can be reduced to the Hamiltonian path problem in graph theory.
The project required working with Researchers in Geneva.
The results showed excellent performance with assembling the genome of the E-coli organism.