Question: SAM file: same reads ID, slightly different sequence
1
gravatar for davide.chiarugi
2.4 years ago by
davide.chiarugi20 wrote:

After aligning with STAR data coming from RNAseq to a bacterial genome, I notice something (for me) unexpected in the SAM output file. As it can be seen from the lines below, some reads having the same ID (first column) have alightly different sequences (10th column) . Why so ?

HWI-ST1299:109:C3ELBACXX:7:1104:21137:23785 16  gi|15674250|ref|NC_002737.1|    1578076 0   45M *   0   0   CCCCAACTACATCAGGCGTTCTAGGGCTTAACTGCTGTGTTCGGC   IEGIGHEGFGGBJJJJJJJIIJJJIHGHEHIHCJJJJIHFHHHFD   NH:i:6  HI:i:1  AS:i:44 nM:i:0
HWI-ST1299:109:C3ELBACXX:7:1104:21137:23785 256 gi|15674250|ref|NC_002737.1|    269351  0   45M *   0   0   GCCGAACACAGCAGTTAAGCCCTAGAACGCCTGATGTAGTTGGGG   DFHHHFHIJJJJCHIHEHGHIJJJIIJJJJJJJBGGFGEHGIGEI   NH:i:6  HI:i:2  AS:i:44 nM:i:0
HWI-ST1299:109:C3ELBACXX:7:1104:21137:23785 272 gi|15674250|ref|NC_002737.1|    1331147 0   45M *   0   0   CCCCAACTACATCAGGCGTTCTAGGGCTTAACTGCTGTGTTCGGC   IEGIGHEGFGGBJJJJJJJIIJJJIHGHEHIHCJJJJIHFHHHFD   NH:i:6  HI:i:3  AS:i:44 nM:i:0
HWI-ST1299:109:C3ELBACXX:7:1104:21137:23785 256 gi|15674250|ref|NC_002737.1|    22063   0   45M *   0   0   GCCGAACACAGCAGTTAAGCCCTAGAACGCCTGATGTAGTTGGGG   DFHHHFHIJJJJCHIHEHGHIJJJIIJJJJJJJBGGFGEHGIGEI   NH:i:6  HI:i:4  AS:i:44 nM:i:0
HWI-ST1299:109:C3ELBACXX:7:1104:21137:23785 256 gi|15674250|ref|NC_002737.1|    28064   0   45M *   0   0   GCCGAACACAGCAGTTAAGCCCTAGAACGCCTGATGTAGTTGGGG   DFHHHFHIJJJJCHIHEHGHIJJJIIJJJJJJJBGGFGEHGIGEI   NH:i:6  HI:i:5  AS:i:44 nM:i:0
HWI-ST1299:109:C3ELBACXX:7:1104:21137:23785 256 gi|15674250|ref|NC_002737.1|    84283   0   45M *   0   0   GCCGAACACAGCAGTTAAGCCCTAGAACGCCTGATGTAGTTGGGG   DFHHHFHIJJJJCHIHEHGHIJJJIIJJJJJJJBGGFGEHGIGEI   NH:i:6  HI:i:6  AS:i:44 nM:i:0
rna-seq sam format alignment • 980 views
ADD COMMENTlink modified 2.4 years ago • written 2.4 years ago by davide.chiarugi20

Thanks for your answer Pierre. My problem is that, if I am not wrong, the 10th column should report the sequence of the read and, thus, I'm expecting that given the same ID in the first column I should have the same read in the 10th column. This case is confusing me, because, apparently, all the reads, even though multi-mapped (as indicated by the SAM flags) look like to map equally well (MAPQ =0, AS=44 for all of them, nM=0 for all of them). In such a case, according to the STAR developers, the primary alignment flag is assigned randomly. Thus, I'm wondering how could it be that reads with (slightly) different sequences:

1) have the same ID 2) map equally well

ADD REPLYlink written 2.4 years ago by davide.chiarugi20

I should have the same read in the 10th column.

wrong: the sequence has been reverse-complemented because the flag says it's on the reverse strand.

map equally well

yes, and a random read is chosen as the 'primary' alignment.

ADD REPLYlink written 2.4 years ago by Pierre Lindenbaum119k

Thanks you indeed for your prompt response. Probably I am misunderstanding something: what does the 10th column exactly report ? The sequence of the "sequenced read" or the sequence to which the read has been aligned ?

ADD REPLYlink written 2.4 years ago by davide.chiarugi20
3

It's the bases in the sequenced read, presented as they would appear on the reference's positive strand. Thus if two reads have been mapped to the same position, they will look the same (modulo sequencing errors etc) in the SAM file, regardless of which strands actually went into the sequencing machine.

[Equivalently to Pierre's reply, but differently phrased.]

ADD REPLYlink modified 2.4 years ago • written 2.4 years ago by John Marshall1.5k

The sequence of the "sequenced read" or the sequence to which the read has been aligned ?

its 'the sequence of the read, rev-compl is the read was mapped on the reverse strand, truncated if the read is hard-clipped (see CIGAR string).

ADD REPLYlink written 2.4 years ago by Pierre Lindenbaum119k

Thank you indeed for the clarifications !

ADD REPLYlink written 2.4 years ago by davide.chiarugi20
2
gravatar for Pierre Lindenbaum
2.4 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum119k wrote:

see http://broadinstitute.github.io/picard/explain-flags.html

reads with flag = 256 are secondary alignments, 272= secondary alignments+ reverse strand.

ADD COMMENTlink written 2.4 years ago by Pierre Lindenbaum119k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 672 users visited in the last hour