Question: Rna-Seq Split Gap Size Cutoff?
2
gravatar for Alex Reynolds
6.9 years ago by
Alex Reynolds30k
Seattle, WA USA
Alex Reynolds30k wrote:

Is there a size cutoff consideration for gaps in paired-end RNA-seq data in BAM format?

For instance, if the BAM file's CIGAR string is 50M422N26M, then is the read for regions across 50M and 26M supposed to be kept intact, because the value of 422N is too large? (Whereas, contrariwise, if the CIGAR string has a N segment smaller than some value, then the read is indeed split across exon boundaries into two pieces?)

What other data in the BAM read would indicate that a split operation would not be appropriate, when an N gap is listed in the CIGAR string?

rnaseq bam • 2.2k views
ADD COMMENTlink modified 12 weeks ago by Biostar ♦♦ 20 • written 6.9 years ago by Alex Reynolds30k
1

Don't you mean, "...Whereas, contrariwise, if the CIGAR string has a N segment smaller than some value, then the read contains a deletion"? In general, if the width of an N operation is quite small (the value likely depends on species and sequencing technology), then you're more likely to have a deletion.

ADD REPLYlink written 6.9 years ago by Devon Ryan96k

Would that value/threshold be contained within or described by the BAM dataset?

ADD REPLYlink modified 6.9 years ago • written 6.9 years ago by Alex Reynolds30k
1

No, at least it's not within the read and I would expect that it's difficult/impossible to highly reliably derive that from the dataset as a whole. I guess someone could argue that given a set of known intron sizes, if you find a gap smaller than the smallest 1% then it's liable to be a deletion. I don't think there's any informatic way to really rule out a deletion over actual splicing (c.f. the difficulties of microexons). You could look for canonical splicing sites, but you'll always be constrained by what's known. I expect the various aligners handle this differently, it'd be interesting to go through their code to see exactly how this is handled.

Of course, if you have genomic sequence from the same samples, then life becomes easy.

ADD REPLYlink written 6.9 years ago by Devon Ryan96k

Thanks, this seems to confirm my expectations.

ADD REPLYlink written 6.9 years ago by Alex Reynolds30k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1520 users visited in the last hour