Frameshifting Algorithm For Ngs Data
Entering edit mode
12.2 years ago
jergosh ▴ 70

I'm dealing with sequences from 39 different S. cerevisiae strains obtained from low-coverage NGS (from this paper [1]). I noticed that when I'm trying to align homologous genes, I get many small gaps in places where there are indels, primarily in low complexity regions. Typically, there's an insertion in one or two strains within a stretch of As (or, conversely, a deletion in some strains). Since it's unlikely that frameshift mutations would be so widespread, I'm assuming these are due to read errors.

All the multiple alignment programs I tried assumed that all bases are 'true,' i. e. they just insert a gap whenever the situation I described occurs. This causes some of the sequences to go out-of-frame. I would like to calculate codon bias measures for these sequences so it's vital that they are all in the same frame.

Is there a multiple alignment software or a frameshifting algorithm that can remove such spurious indels?


next-gen msa • 2.3k views
Entering edit mode
12.2 years ago

I don't know of any software that does this, but it shouldn't be too hard to implement. You could try using seqinR in R, which you could then use to compute codon usage bias.

A quick idea, based on the fact that you are really only looking for coding parts, would be -- on the aligned sequences -- to add/retrieve a nucleotide each time you get to a sequence that is clearly different from the other ones (at the amino acid level).

Entering edit mode

I couldn't find a ready-made tool to do this. I'm going to roll my own script to remove that gaps but that's likely not the 'right' way to do it. An assembly tool which takes conservation and read quality scores into account would be a more correct way to do this...


Login before adding your answer.

Traffic: 2613 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6