Samtools problems with length and position
2
1
Entering edit mode
9.6 years ago
dahliashvets ▴ 10

Hello, I am a beginner and trying to understand samtools.

I have no problem getting samtools to do what I want, I have trouble interpreting the results and the different fields in the format. For instance when I say:

samtools view file.bam chr20:628200-628400

Why do I get a value with position values equal to : 628004-628107?

This is obviously not in my specified range so why did it come up?

Second question, the "length" column in samtools is not equal to the actual length of my sequence. It can range anywhere from -202 and 173, but my reads are all 100bp.

Any ideas?

I am ever grateful because this has been driving me totally insane.

Thank you!

samtools bam • 3.7k views
ADD COMMENT
0
Entering edit mode

Please show us a few reads printed from samtools view file.bam chr20:628200-628400. I want to see the chromosomes and the flags.

The length is the distance between the two reads . Negative is reverse is mapped before the Forward.

ADD REPLY
0
Entering edit mode

This is the command that I give it below and following that is the output:

samtools view HG00096.mapped.ILLUMINA.bwa.GBR.low_coverage.20120522.sorted.bam 20:628200-628400
SRR062641.20611504    147    20    628107    60    100M    =    628004    -202    GTGACACTGTCCTAGTCCCAGTTCCAGGCCAAGAACTCAAGAAGTCTTTCAGCTTCTACTCTCACTCTCTTTGGAATCCTGAAGCCATCATGTGAATAAG    >C>7DDEIGENIILII;RIOHIJJKSRISNMSHQJMIRQSQQSQMRQRMRSRRQLQFHKQQQQHRQQQQQQLRQPPKRJKPPRQQQOPPOPNPNNMLLM0    X0:i:1    X1:i:0    MD:Z:100    RG:Z:SRR062641    AM:i:37    NM:i:0    SM:i:37    MQ:i:60    XT:A:U    BQ:Z:@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
SRR062641.18258344    147    20    628111    60    100M    =    628033    -177    CACTGTCCTAGTCCCAGTTCCAGGCCAAGAACTCAAGAAGTCTTTCAGCTTCTACTCTCACTCTCTTTGGAATCCTGAAGCCATCATGTGAATAAGCCTG    DEDGDDFFFAGJRKKKPJJJJKRJNJPNLQERMJMRQDSQGRRRERSRRRQRPGRQRGQGQKQQRQQRRJPJQRQQPPRQQQPPQOQOPONNNNPINML0    X0:i:1    X1:i:0    MD:Z:100    RG:Z:SRR062641    AM:i:37    NM:i:0    SM:i:37    MQ:i:60    XT:A:U    BQ:Z:@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
SRR062634.12538995    83    20    628174    60    4S96M    =    628086    -183    ACTCTCTTTGGAATCCTGAAGCCATCATGTGAATAAGCCTGAGCTAGCCTGCTGGAGAGGCCACATGGAGGAGAACCAAGCAATTCCAGCCAACAGCCTG    #####@CDFB9@A3B<CAEJ<JJ?JIFHEIFIH9IJ?IGGDDDEBNFIHIHGRHQRQRRRGQEAPLRQRQQR=PPEQPRQQPPQQRNRQQQOOPQPNLK0    X0:i:1    X1:i:0    XC:i:96    MD:Z:96    RG:Z:SRR062634    AM:i:37    NM:i:0    SM:i:37    MQ:i:60    XT:A:U    BQ:Z:@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
SRR062635.21895688    99    20    628282    60    100M    =    628366    183    CAAGTGAGTAAGGCTATCCTAGACCATCCAGGCCCAGCTGAGGTAACAGCTGACCTTAGCTCCATGAGGGATCCCAGGTGTGAGCAGCAGAACTGCCCTG    /FFE@JI>IHQDEQROPPQRPQQPRRPQQRGLQ>RMLLIB;DD8?DGFDDKMGKMIHHMRDLN=CBA@@BEDH@I?C=4E>E:C<HEDFB=D@DCB@??A    X0:i:1    X1:i:0    MD:Z:100    RG:Z:SRR062635    AM:i:37    NM:i:0    SM:i:37    MQ:i:60    XT:A:U    BQ:Z:@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
SRR062641.20651099    163    20    628323    60    86M14S    =    628421    173    GGTAACAGCTGACCTTAGCTCCATGAGGGATCCCAGGTGTGAGCAGCAGAACTGCCCTGTTGAGCCCAGCCCAAATTGCCAACCCAAAGGATTGCGGGAA    0?C><AA<CA==7<D192.959B2;:=*<98<C<>=8.:3:CDAF45BDD5<GB:GF7D<AC056F806:F;05.<(<.<C?:<@###############    X0:i:1    X1:i:0    XC:i:86    MD:Z:86    RG:Z:SRR062641    AM:i:37    NM:i:0    SM:i:37    MQ:i:60    XT:A:U    BQ:Z:@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
SRR062635.21895688    147    20    628366    60    100M    =    628282    -183    CAGCAGAACTGCCCTGTTGAGCCCAGCCCAAATTGCCAACCCATAGGATTGTGGGAAAATACAGGGTTTGTTTCTAGCTGCCAAGTTTGGGGATGGTTTG    ?2:DA>DAEG=F;FGA?KLG6I@KGH>JRPHAIFRMMHDIAIHGMRKGMRPB@Q:PGPPPGQIQR:QQRPKGPQIRQQQQKEOQIPPPPPPNNOOMMML0    X0:i:1    X1:i:0    MD:Z:100    RG:Z:SRR062635    AM:i:37    NM:i:0    SM:i:37    MQ:i:60    XT:A:U    BQ:Z:@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
SRR062635.14867556    99    20    628384    60    100M    =    628471    186    GAGCCCAGCCCAAATTGCCAACCCATAGGATTGTGGGAAAATACAGGGTTTGTTTCTAGCTGCCAAGTTTGGGGATGGTTTGTTACACAGCAAAAGCTAA    0JLMIPQQPQQQQQOPJQQRKPRQLPPRRFPKQAQGRFGLRPPPRRSRCLQRILGQSHGMKRQQJAEGEMKJJN>IKG6FEKHHHILIK?HJAHCDCD@F    X0:i:1    X1:i:0    MD:Z:100    RG:Z:SRR062635    AM:i:37    NM:i:0    SM:i:37    MQ:i:60    XT:A:U    BQ:Z:@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@D
SRR062635.21823539    99    20    628392    60    96M4S    =    628484    191    CCCAAATTGCCAACCCATAGGATTGTGGGAAAATACAGGGTTTGTTTCTAGCTGCCAAGTTTGGGGATGGTTTGTTACACAGCAAAAGCTAACTGGTACA    0K=NPPOOJQ<KKIJCFDEJF@EFGHRRFFQQGEPKMKMLIQLILQGGCEBMHMMII?DDMINNNJADNIBF@FE=CBF>HBHLJH9C>DDA=DD#####    X0:i:1    X1:i:0    XC:i:96    MD:Z:96    RG:Z:SRR062635    AM:i:37    NM:i:0    SM:i:37    MQ:i:60    XT:A:U    BQ:Z:@I@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
ADD REPLY
0
Entering edit mode

Also, could you elaborate on what you mean by the length is the distance between two reads? Because one read begins at 628004 and the next begins at 628033. The distance between those two reads is not 202 or 177 which is what the length value is for both of those reads...

ADD REPLY
3
Entering edit mode
9.6 years ago

Q: Why do I get a value with position values equal to : 628004-628107? This is obviously not in my specified range so why did it come up?

A: Samtools will output alignments that fully or partially overlap the given coordinates. In simple words if the alignment starts or ends within the given coordinates samtools will report it. In your case the first alignment starts at 628107 but ends at 628207 (628107 + 100 bp read length) and the end point falls in the given range 628200-628400.

Q: Second question, the "length" column in samtools is not equal to the actual length of my sequence. It can range anywhere from -202 and 173, but my reads are all 100bp.

A: The ninth column in the SAM file doesn't represent read length but TLEN (~insert size) for the paired end library which is number of bases from the leftmost mapped base to the rightmost mapped base. For leftmost segment or read in the paired end library the TLEN will have a plus sign and vice-versa. For more information check this out: http://samtools.github.io/hts-specs/SAMv1.pdf

ADD COMMENT
0
Entering edit mode

The alignment does not end at 628207, it starts at 628004 and ends at 628107, neither of which are in the specified range.

Any other thoughts as to why it might be appearing?

Thank you for your answers.

I still don't quite understand the TLEN value though. So if I have two reads that are part of the same pair the TLEN is the distance from the leftmost base of the first read to the rightmost base of the second read?

And if I have two reads that are a pair, how do I know which two are pairs? Will some sort of flag signify this?

ADD REPLY
2
Entering edit mode

First of all go through the pdf from the link that I have posted above. Then google for single end reads versus paired-end reads and try to read about them.

You said "The alignment does not end at 628207, it starts at 628004 and ends at 628107". This is wrong. It means the alignment starts at 628107 (4th column). The other number "628004" is the starting point of the alignment of its mate.

Your data is a paired end data. For single end data the value in the ninth column is always zero. Another proof is read "SRR062635.21895688" which appears twice. The first alignment is for the leftmost strand represented by

SRR062635.21895688    99    20    628282    60    100M    =    628366    183 

Here "628282" represents the starting point of this alignment and "628366" represents starting point of its mate or the rightmost alignment.

Similarly, the second alignment with the same read name represents alignment of the rightmost read of the read-pair

SRR062635.21895688    147    20    628366    60    100M    =    628282    -183 

Here the numbers have been switched and the sign of the TLEN changed.

ADD REPLY
0
Entering edit mode

Thank you a a million times over for helping me understand this! I have been reading about this for a few days and was unable to gather enough information from the manual to fully understand what was going on. So just to refresh, since this is pair-end sequencing and we have two reads, the pair and its mate, are these two reads on opposite strands? Or on the same strands? My professor says they are on the same but I have read online that they are on opposite strands.

THANK YOU AGAIN!

ADD REPLY
0
Entering edit mode

They'll normally map to opposite strands but will only give information about one strand, since single stranded DNA is what's actually sequenced. I realize that this will make no sense when you read it, so I recommend watching a video on paired-end sequencing, since this is something that's easier (for me at least) to follow visually.

ADD REPLY
0
Entering edit mode

Thanks! I actually watched this video earlier this week. Still don't totally understand the end of it though. Why won't they just make their data more blatant and tell you exactly what they did.

What do you mean by they only give information about one strand? Like in my BAM file, once I find a read and its mate, are they on the same strand?

ADD REPLY
0
Entering edit mode

Normally mates will map to opposite strands (you can use the flag field in a BAM or SAM file to check this). However, consider something like RNAseq. Genes can be on opposite strands and overlap, but we'd still like to be able to determine from which of the overlapping genes a given pair of reads originated. Since we start off with single stranded DNA when we do sequencing, the reads are only conveying information about a single strand...the one from which the fragment we're sequencing arose.

I should note that this is only important for things like RNAseq and bisulfite sequencing. If you're looking for SNPs then you're not going to care about the directionality of your sequencing libraries.

ADD REPLY
0
Entering edit mode

No, the alignment starts at 628107 (column 4) and extends through 628206 (this isn't stored in the SAM or BAM formats), its mate starts its alignment at 628004 (column 8).

ADD REPLY
1
Entering edit mode
9.6 years ago

From your SAM, the BAM has been created using a reference sequence without the chr prefix . So you should try:

samtools view file.bam  20:628200-628400

and not

samtools view file.bam  chr20:628200-628400
ADD COMMENT
0
Entering edit mode

Thank you for your post but this will not help. The reason I excluded the "chr" from my command is because if you will see in my file it is excluded. If I were to include the "chr" it will give me an error. This BAM file format is slightly off from the norm. This is why it is not included in my command.

ADD REPLY

Login before adding your answer.

Traffic: 3138 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6