Abstract (EN):
RNA-Seq is a Next-Generation Sequencing (NGS) protocol for sequencing the messenger RNA in a cell and generates millions of short sequence fragments, reads, in a single run. These reads can be used to measure levels of gene expression and to identify novel splice variants of genes. One of the critical steps in an RNA-Seq experiment is mapping NGS reads to the reference genome. Because RNA-Seq reads can span over more than one exon in the genome, this task is challenging. In the last decade, tools for RNA-Seq alignment have emerged, but most of them run in two phases. First, the pipeline only maps reads that have a direct match in the reference, and the remaining are set aside as initially unmapped reads. Then, they use heuristics based approaches, clustering or even annotations, to decide where to align the later. This work presents an efficient computational solution for the problem of transcriptome alignment, named SpliceTAPyR. It identifies signals of splice junctions and relies on compressed full-text indexing methods and succinct data structures to efficiently align RNA-Seq reads in a single phase. This way it achieves the same or better accuracy than other tools while using considerably less memory and time to the most competitive tools.
Language:
English
Type (Professor's evaluation):
Scientific
No. of pages:
14