Rna-Seq Split Gap Size Cutoff?
0
2
Entering edit mode
8.9 years ago

Is there a size cutoff consideration for gaps in paired-end RNA-seq data in BAM format?

For instance, if the BAM file's CIGAR string is 50M422N26M, then is the read for regions across 50M and 26M supposed to be kept intact, because the value of 422N is too large? (Whereas, contrariwise, if the CIGAR string has a N segment smaller than some value, then the read is indeed split across exon boundaries into two pieces?)

What other data in the BAM read would indicate that a split operation would not be appropriate, when an N gap is listed in the CIGAR string?

rnaseq bam • 2.6k views
1
Entering edit mode

Don't you mean, "...Whereas, contrariwise, if the CIGAR string has a N segment smaller than some value, then the read contains a deletion"? In general, if the width of an N operation is quite small (the value likely depends on species and sequencing technology), then you're more likely to have a deletion.

0
Entering edit mode

Would that value/threshold be contained within or described by the BAM dataset?

1
Entering edit mode

No, at least it's not within the read and I would expect that it's difficult/impossible to highly reliably derive that from the dataset as a whole. I guess someone could argue that given a set of known intron sizes, if you find a gap smaller than the smallest 1% then it's liable to be a deletion. I don't think there's any informatic way to really rule out a deletion over actual splicing (c.f. the difficulties of microexons). You could look for canonical splicing sites, but you'll always be constrained by what's known. I expect the various aligners handle this differently, it'd be interesting to go through their code to see exactly how this is handled.

Of course, if you have genomic sequence from the same samples, then life becomes easy.

0
Entering edit mode

Thanks, this seems to confirm my expectations.