Question: How does a skipped region from a CIGAR string (N) look in the alignment?
3
gravatar for Niek De Klein
4.2 years ago by
Niek De Klein2.5k
Netherlands
Niek De Klein2.5k wrote:

I want to know how a skipped region in the reference, or N in the CIGAR string, looks in the alignment. To try and explain what I mean I use the example provided from the SAM format specification (http://genome.sph.umich.edu/wiki/SAM), which does not include an N example:

Ref + read
RefPos:     1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19
Reference:  C  C  A  T  A  C  T  G  A  A  C  T  G  A  C  T  A  A  C
Read: ACTAGAATGGCT

Alignment
RefPos:     1  2  3  4  5  6  7     8  9 10 11 12 13 14 15 16 17 18 19
Reference:  C  C  A  T  A  C  T     G  A  A  C  T  G  A  C  T  A  A  C
Read:                   A  C  T  A  G  A  A     T  G  G  C  T

Cigar:
POS: 5
CIGAR: 3M1I3M1D5M

Now, in position 11 there is an insertion in the reference sequence. However, I would think that you can't distinguish between a skipped region or an insertion in the reference. Therefore the CIGAR string could also have been 3M1I3M1N5M

So how is it the alignment of a skipped region or an insertion in the reference sequence different? Is it only a skipped region if the C in position 11 is an N?

alignment cigar • 2.4k views
ADD COMMENTlink modified 4.2 years ago by Devon Ryan91k • written 4.2 years ago by Niek De Klein2.5k
3
gravatar for Devon Ryan
4.2 years ago by
Devon Ryan91k
Freiburg, Germany
Devon Ryan91k wrote:

There's no a priori way to always distinguish between deletions (D CIGAR operations) and splicing (N CIGAR operations). In practice, most RNAseq aligners (e.g., tophat2 and STAR) have parameters with semi-arbitrary thresholds for the minimum intron size or maximum deletion size. In the case of STAR, any gap less than alignIntronMin (21 bases last I looked) is considered a deletion. I can't recall exactly what the tophat2 option for this is off-hand, but it's in there somewhere.

It's probably worth pointing out that the default values for these might be worth changing in some cases. I suspect that if someone were interested in splicing changes in cancer cells where there are a bunch of deletions that these parameters might need some tweaking (though presumably one would do WGS or WES alongside).

ADD COMMENTlink modified 4.2 years ago • written 4.2 years ago by Devon Ryan91k

It was not clear to me that the N CIGAR operations are supposed to represent introns, this makes sense now. Would it make sense to mask known introns in the reference sequence to make the D/N assignment less arbitrary? I just quickly looked for a paper on indel sizes and this one: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC1557762/ mentions indels of up to 9989 in size, so if I understand you correctly, in case of the STAR default value any of the indels above 21 bases would wrongly be considered an intron?

ADD REPLYlink modified 4.2 years ago • written 4.2 years ago by Niek De Klein2.5k
1

That's a reasonable approach. It should be noted that things are likely to function differently if one supplies a GTF file than if not. I would presume that if an annotation is available that STAR will look at that to determine possible splicing first, though you'd still be correct that any deletion (or Indel, as you pointed out) >= 21 bases should still be getting classified as a splicing event by default. One possible way around that would be to somehow specify that only annotated exon boundaries are allowed (STAR probably has an option for that already). Realistically speaking though, if people are really interested in finding indels they should probably just sequence the DNA rather than RNA. Then any apparent even like this will be a deletion and splicing events wouldn't ever occur.

ADD REPLYlink written 4.2 years ago by Devon Ryan91k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 734 users visited in the last hour