Question: How To Get Unmapped Reads From A Bam File Obtained Using Gsnap And Convert It Into A Fastq File?
0
gravatar for komal.rathi
4.0 years ago by
komal.rathi3.0k
Children's Hospital of Philadelphia, Philadelphia, PA
komal.rathi3.0k wrote:

Hi everyone,

I have been looking for a way to get unmapped reads from the bam file obtained using GSNAP and convert it into a fastq file but I don't know what are the flags used by GSNAP to mark unmapped reads from mapped ones. Any suggestions?

Thanks

fastq bam • 3.1k views
ADD COMMENTlink modified 4.0 years ago • written 4.0 years ago by komal.rathi3.0k
1
gravatar for Devon Ryan
4.0 years ago by
Devon Ryan73k
Freiburg, Germany
Devon Ryan73k wrote:

Gsnap uses the same flags as every other aligner (i.e., 0x4 means unmapped), there's a specification. Something like the following will convert to fastq:

samtools view -f 0x4 foo.bam | awk 'BEGIN{OFS=""}{print ">",$1,"\n",$10,"\n+\n",$11;}' > foo.unmapped.fastq

Edit: Given your comment, if you want to keep pairs intact, then something like the following will work (it assumes that mates have the same name and are sequential, so don't expect it to work otherwise):

#!/usr/bin/env python
import csv
import sys

if(len(sys.argv) != 2) :
    print("Usage: samtools view -f 0x4 foo.bam | %s prefix\n" % sys.argv[0])
    sys.exit(1)

f = csv.reader(sys.stdin, dialect="excel-tab")
prefix = sys.argv[1]
paired1 = open("%s_1.fastq" % prefix, "w")
paired2 = open("%s_2.fastq" % prefix, "w")
unpaired = open("%s.unpaired.fastq" % prefix, "w")
last = None
for line in f :
    if(last == None) :
        last = [line[0], line[9], line[10]]
    else :
        if(last[0] == line[0]) :
            paired1.write("%s\n%s\n+\n%s\n" % (last[0], last[1], last[2]))
            paired2.write("%s\n%s\n+\n%s\n" % (line[0], line[9], line[10]))
            last = None
        else :
            unpaired.write("%s\n%s\n+\n%s\n" % (last[0], last[1], last[2]))
            last = [line[0], line[9], line[10]]
if(last != None) :
    unpaired.write("%s\n%s\n+\n%s\n" % (last[0], last[1], last[2]))

paired1.close()
paired2.close()
unpaired.close()

If you saved that as foo.py and made it executable, then something like samtools view -f 0x4 somefile.bam | ./foo.py blah will produce blah_1.fastq, blah_2.fastq and blah.unpaired.fastq.

ADD COMMENTlink modified 4.0 years ago • written 4.0 years ago by Devon Ryan73k

By the way I used: samtools view -f 0x0004 -h foo.bam | java -jar /opt/picard/SamToFastq.jar INPUT=/dev/stdin FASTQ=foo_unmapped_R1.fastq SECOND_END_FASTQ=foo_unmapped_R2.fastq But because I am specifying that my reads are paired-end and there are some reads with one mate mapped and the other one unmapped, it is giving me an exception. I will try your suggestion. Thanks.

ADD REPLYlink written 4.0 years ago by komal.rathi3.0k

Ah, yeah, my solution won't keep pairs as pairs, but will make everything single-end. If need pairs kept as pairs, then just throw together a little python/perl/whatever script (it can probably be done in awk, but it'd be easier not to).

ADD REPLYlink written 4.0 years ago by Devon Ryan73k

I don't want the unpaired reads, just the paired ones. Update: I checked your script for one sample and it gave me the same number of lines in the R1 & R2 fastq files as SamToFastq. Which means that SamToFastq ignores unpaired reads and only writes paired-reads when SECOND_END_FASTQ is specified. Thanks a lot for the help!

ADD REPLYlink modified 4.0 years ago • written 4.0 years ago by komal.rathi3.0k

Good to know, glad that worked for you.

ADD REPLYlink written 4.0 years ago by Devon Ryan73k
0
gravatar for komal.rathi
4.0 years ago by
komal.rathi3.0k
Children's Hospital of Philadelphia, Philadelphia, PA
komal.rathi3.0k wrote:

Umm, I retested a few samples and the python script also includes reads that look like:

HWI-ST1006:111:D2AADACXX:3:2211:9255:84265    165    *    0    0    *    chr10    48221505    0    TTTAACCCTTCTGCCAATCCAGAGGCAAGCACAATATTCCAGAGGAACTCTCAAACAGATGTTGTAGAAATAAGAAGAAGCAACTGTACAAACCATTGT    @@FFFFFHHHHHJJJJJJJJJJJJJJJJJIJJJIJJJJJJJJJJJJJJJIJJJJJJJJJJJJJJHIJJJJHHHHHHFFFFFEEEDEDFDEDDDDDDDDE    RG:Z:C00079
HWI-ST1006:111:D2AADACXX:3:2211:9255:84265    133    *    0    0    *    chr10    47197548    0    TTTAACCCTTCTGCCAATCCAGAGGCAAGCACAATATTCCAGAGGAACTCTCAAACAGATGTTGTAGAAATAAGAAGAAGCAACTGTACAAACCATTGT    @@FFFFFHHHHHJJJJJJJJJJJJJJJJJIJJJIJJJJJJJJJJJJJJJIJJJJJJJJJJJJJJHIJJJJHHHHHHFFFFFEEEDEDFDEDDDDDDDDE    RG:Z:C00079

As you can see the reads are mapped to chr10. Your script looks perfect, dont know what's the problem.

Essentially it should only include unpaired-umapped reads, for e.g.:

HWI-ST1006:111:D2AADACXX:3:1101:4029:2482    77    *    0    0    *    *    0    0    AATGGGGGGGGAAAAACCCGAAACCCCCCCCCAACACCCCCACCCTCTTTCCCCCCTTTTCTTTTTAAAAAAAAAAAAAAACAAAAAAAAGAAACACAC    ###################################################################################################    RG:Z:C00079
HWI-ST1006:111:D2AADACXX:3:1101:4029:2482    141    *    0    0    *    *    0    0    AAACGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGATTGGGGGGGGGGGGGGGGTGTGAAAAAAAAAAAAAAAAAAAAAAAAAAAAA    ###################################################################################################    RG:Z:C00079

This works:

samtools view -f 12 -h foo.bam | java -jar /opt/picard/SamToFastq.jar INPUT=/dev/stdin FASTQ=foo_R1.fastq foo_R2.fastq

The flag 12 comes from here. It will only filter-in a read that is unmapped and whose mate is also unmapped.

ADD COMMENTlink written 4.0 years ago by komal.rathi3.0k

None of the reads you posted are mapped. If the 0x4 bit is set in the flag then you can ignore the rest.

ADD REPLYlink written 4.0 years ago by Devon Ryan73k

Oh really? I don't know how can I test the validity of SamToFastq then? When I run your script it includes the first set of reads and when I use SamToFastq it doesn't have those reads. I used -f 12 instead because it filters-in unmapped reads where both mates are unmapped.

ADD REPLYlink modified 4.0 years ago • written 4.0 years ago by komal.rathi3.0k

You can always look at the source code. It should just be doing something similar to what the python script I posted does. BTW, there's a difference between 0x12 and 12 (I only mention this because you used 12 in your post and 0x12 in your comment). 0x12=18, so you'd get properly paired reads that are in the reverse orientation.

ADD REPLYlink modified 4.0 years ago • written 4.0 years ago by Devon Ryan73k

Thanks, I dint really know the difference. Although, I checked my commandline and it says 12 so I am relieved.

ADD REPLYlink written 4.0 years ago by komal.rathi3.0k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1336 users visited in the last hour