Entering edit mode
10.5 years ago
Alex Reynolds
35k
Is there a size cutoff consideration for gaps in paired-end RNA-seq data in BAM format?
For instance, if the BAM file's CIGAR string is 50M422N26M
, then is the read for regions across 50M
and 26M
supposed to be kept intact, because the value of 422N
is too large? (Whereas, contrariwise, if the CIGAR string has a N
segment smaller than some value, then the read is indeed split across exon boundaries into two pieces?)
What other data in the BAM read would indicate that a split operation would not be appropriate, when an N
gap is listed in the CIGAR string?
Don't you mean, "...Whereas, contrariwise, if the CIGAR string has a N segment smaller than some value, then the read contains a deletion"? In general, if the width of an N operation is quite small (the value likely depends on species and sequencing technology), then you're more likely to have a deletion.
Would that value/threshold be contained within or described by the BAM dataset?
No, at least it's not within the read and I would expect that it's difficult/impossible to highly reliably derive that from the dataset as a whole. I guess someone could argue that given a set of known intron sizes, if you find a gap smaller than the smallest 1% then it's liable to be a deletion. I don't think there's any informatic way to really rule out a deletion over actual splicing (c.f. the difficulties of microexons). You could look for canonical splicing sites, but you'll always be constrained by what's known. I expect the various aligners handle this differently, it'd be interesting to go through their code to see exactly how this is handled.
Of course, if you have genomic sequence from the same samples, then life becomes easy.
Thanks, this seems to confirm my expectations.