Question: Unmapped Reads And Sequence Name
1
gravatar for Stefano Berri
6.8 years ago by
Stefano Berri4.0k
Cambridge, UK
Stefano Berri4.0k wrote:

Hi. I have two questions that might or might not be related.

I want to filter all and only unmapped reads from a pair end file.

I type

samtools view -f 4 myFile.bam

According to samtools, -f 4 should 'only output alignments' where the query itself is unmapped.

However the beginning of the produced output is like

HWI-ST300_0110:7:21:1528:160264#0       117     chr10   67604   0       *       =       67604   0       <SEQ>    <QUAL>
HWI-ST300_0110:7:7:10624:6076#0 117     chr10   78098   0       *       =       78098   0       <SEQ>    <QUAL>
HWI-ST300_0110:7:5:15368:185558#0       69      chr10   78778   0       *       =       78778   0       <SEQ>    <QUAL>

These are not unmapped. The mate is, not the query, right? Later in the output there are also the unmapped reads. With flag -f 12, I correctly get those where both query and mate pair are unmapped. Why is this happening?

The second question is about sequence name. The fastq files I was provided with, have the name like:

@HWI-ST300:130:B08M9ABXX:2:1101:1137:1993 2:Y:0:

with the space where it is shown.

That means I have both pairs that have the very same name in the resulting bam files (HWI-ST300:130:B08M9ABXX:2:1101:1170:1992). Can this be an issue? What are the specifications about sequence names in fastq format? According to this paper spaces don't seem to be an issue, but yet bwa trim after the space.

Thanks

ADD COMMENTlink written 6.8 years ago by Stefano Berri4.0k
2
gravatar for Istvan Albert
6.8 years ago by
Istvan Albert ♦♦ 77k
University Park, USA
Istvan Albert ♦♦ 77k wrote:

The easiest way to see what a flag means is to visit the Explain SAM flags page and verify what your flags (in this case 117 and 69) stand for. For example:

117: read paired, read unmapped, read on reverse strand, mate on reverse strand, first in pair

So the output is correct, what it probably means is that the read and its mate are both on the reverse strand therefore the mapping of this read (and its mate) are questionable, thus it got the unmapped designation.

Question 2: while in principle the space should not matter by the standard,in general these should be avoided since some tools make assumptions (plus the spaces can be visually misleading). I would replace the spaces with an underscore or minus

ADD COMMENTlink written 6.8 years ago by Istvan Albert ♦♦ 77k

Thanks! Very useful!. But so, they are "officially" unmapped, despite having a mapping field. Confusing. 69 means: read paired, read unmapped, first in pair. Even more confusing. Anyway, I guess I will filter the others with -F 4 to make sure I don't have the same sequence in two different files after splitting.

ADD REPLYlink written 6.8 years ago by Stefano Berri4.0k

I think there is no standard designation for what 'unmapped' should mean - I usually take it as in indication that the measurement is unreliable.

ADD REPLYlink written 6.8 years ago by Istvan Albert ♦♦ 77k
2
gravatar for Bach
6.8 years ago by
Bach540
Bach540 wrote:

You might want to split posts to Biostar if they contain different, not related questions. That helps to better find relevant Q & A after some time.

Anyway, regarding question number 2: in their latest pipeline, Illumina changed the naming scheme of their reads by leaving out the pair-identifier from the name component and pushing it to the 'comment' component.

This change is most unfortunate as it breaks all tools I know which use the naming of a read to determine whether it is in a pair or not. Furthermore, it makes searching for reads by name impossible as in paired-end/mate-pair sequencing you'll get two reads as answer. Oh, the fun. Illumina devs should be tarred and feathered for this.

If you want to avoid problems: rename your reads to follow "usual" naming styles with a postfix by pulling the pair identifier to the read name and concatenating it with a slash. I.e.: change

@HWI-ST300:130:B08M9ABXX:2:1101:1137:1993 1:Y:0: @HWI-ST300:130:B08M9ABXX:2:1101:1137:1993 2:Y:0:

to

@HWI-ST300:130:B08M9ABXX:2:1101:1137:1993/1 1:Y:0: @HWI-ST300:130:B08M9ABXX:2:1101:1137:1993/2 2:Y:0:

ADD COMMENTlink written 6.8 years ago by Bach540
3

oh my - these are the times when you can clearly sense that the people making these decisions don't have to to any analysis themselves - let's prepare ourselves to the posts - how do I convert my read names to a different type of read name! As if we did not have enough problems already.

ADD REPLYlink written 6.8 years ago by Istvan Albert ♦♦ 77k

Thanks for the heads-up on this, just stumbled on this problem today and would have wasted a lot of time if not for your post

ADD REPLYlink written 6.7 years ago by Casey Bergman17k
2
gravatar for Swbarnes2
6.8 years ago by
Swbarnes21.4k
Swbarnes21.4k wrote:

"For a unmapped paired-end or mate-pair read whose mate is mapped, the unmapped read should have RNAME and POS identical to its mate."

http://samtools.sourceforge.net/SAM1.pdf

That's why unmapped reads can have coordinates. If the 4 is flagged, it's unmapped, regardless of what positions and reference names, and CIGAR strings suggest.

Two reads with the same name are totally fine. You will be able to tell which came from which fastq by the flag (reads from the first fastq have 64 flagged, reads from the second fastq have 128 flagged). In fact, for some functions, the reads have to be the same. I had a bunch of reads with :a and :b to indicate which read they were, and Picard's MarkDuplicates refused to flag any of them as duplicates, because it wanted pairs to have the exact same name.

ADD COMMENTlink written 6.8 years ago by Swbarnes21.4k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1844 users visited in the last hour