Gsnap uses the same flags as every other aligner (i.e., 0x4 means unmapped), there's a specification. Something like the following will convert to fastq:
samtools view -f 0x4 foo.bam | awk 'BEGIN{OFS=""}{print ">",$1,"\n",$10,"\n+\n",$11;}' > foo.unmapped.fastq
Edit: Given your comment, if you want to keep pairs intact, then something like the following will work (it assumes that mates have the same name and are sequential, so don't expect it to work otherwise):
#!/usr/bin/env python
import csv
import sys
if(len(sys.argv) != 2) :
print("Usage: samtools view -f 0x4 foo.bam | %s prefix\n" % sys.argv[0])
sys.exit(1)
f = csv.reader(sys.stdin, dialect="excel-tab")
prefix = sys.argv[1]
paired1 = open("%s_1.fastq" % prefix, "w")
paired2 = open("%s_2.fastq" % prefix, "w")
unpaired = open("%s.unpaired.fastq" % prefix, "w")
last = None
for line in f :
if(last == None) :
last = [line[0], line[9], line[10]]
else :
if(last[0] == line[0]) :
paired1.write("%s\n%s\n+\n%s\n" % (last[0], last[1], last[2]))
paired2.write("%s\n%s\n+\n%s\n" % (line[0], line[9], line[10]))
last = None
else :
unpaired.write("%s\n%s\n+\n%s\n" % (last[0], last[1], last[2]))
last = [line[0], line[9], line[10]]
if(last != None) :
unpaired.write("%s\n%s\n+\n%s\n" % (last[0], last[1], last[2]))
paired1.close()
paired2.close()
unpaired.close()
If you saved that as foo.py
and made it executable, then something like samtools view -f 0x4 somefile.bam | ./foo.py blah
will produce blah_1.fastq
, blah_2.fastq
and blah.unpaired.fastq
.
By the way I used: samtools view -f 0x0004 -h foo.bam | java -jar /opt/picard/SamToFastq.jar INPUT=/dev/stdin FASTQ=foo_unmapped_R1.fastq SECOND_END_FASTQ=foo_unmapped_R2.fastq But because I am specifying that my reads are paired-end and there are some reads with one mate mapped and the other one unmapped, it is giving me an exception. I will try your suggestion. Thanks.
Ah, yeah, my solution won't keep pairs as pairs, but will make everything single-end. If need pairs kept as pairs, then just throw together a little python/perl/whatever script (it can probably be done in awk, but it'd be easier not to).
I don't want the unpaired reads, just the paired ones. Update: I checked your script for one sample and it gave me the same number of lines in the R1 & R2 fastq files as SamToFastq. Which means that SamToFastq ignores unpaired reads and only writes paired-reads when SECOND_END_FASTQ is specified. Thanks a lot for the help!
Good to know, glad that worked for you.