Collect Read Pairs Where At Least One Read Is Mapped
5
2
Entering edit mode
10.8 years ago
fridhackery ▴ 170

I might word my initial question like another post, but I really have the opposite meaning, i think: Filtering multiple flags with SAMtools

I am trying to remove paired-end reads from a .SAM file where neither segment is mapped

But by "remove" I don't mean collect. I want all the read pairs where the forward read OR the reverse read OR both reads are mapped then I will use bam2fastq to get the reads and assemble.

I think there are pieces missing from my reference. I will use all these reads to try to assemble a better reference. So if one read maps to the reference, but its pair does not, that is a good read for me; the reverse read is possibly part of the sequence that is missing from my reference.

What SAM flags should I set?

samtools sam paired-end • 8.8k views
ADD COMMENT
2
Entering edit mode
10.8 years ago
fridhackery ▴ 170

matted almost got it. This is what worked:

# exactly one end mapped
samtools view -bh -F 4 -f 8 in.bam > out1.bam
samtools view -bh -F 8 -f 4 in.bam > out2.bam

# exactly two ends mapped
samtools view -bh -F 12 in.bam > out3.bam

then merge and sort -n

Thank you all!

ADD COMMENT
1
Entering edit mode
10.8 years ago

I wrote a java tool filtering SAM/BAm file using javascript (rhino). It's available at: https://github.com/lindenb/jvarkit#samjs-filtering-a-sambam-file-with-javascript )

To get "the reads unmapped or the mate unmapped" you can filter the BAM with:

java -jar dist/samjs.jar I=input.bam \
    VALIDATION_STRINGENCY=SILENT \
    SCRIPT_EXPRESSION="record.readUnmappedFlag || record.mateUnmappedFlag;" \
    OUT=result.bam

and then get back the FASTQs using picard/samToFastq.jar

To get "exactly one end or two ends mapped" you can filter the BAM with:

java -jar dist/samjs.jar I=input.bam \
    VALIDATION_STRINGENCY=SILENT \
    SCRIPT_EXPRESSION="!(record.readUnmappedFlag && record.mateUnmappedFlag);" \
    OUT=result.bam
ADD COMMENT
0
Entering edit mode
10.8 years ago

This command will remove the unmapped reads and will output all the mapped reads into a new.bam file.

samtools view -bh -F 4 original.bam -> new.bam

-F option will skip the alignment that match a given flag and -f option will include those alignments.

ADD COMMENT
0
Entering edit mode

This won't do what he wants, since he wants to convert the bam back to fastq (with both end of a read pair present) for use in assembly. This will make a bam with many reads with missing mates.

ADD REPLY
0
Entering edit mode

I don't think so. That is what I had tried first, and again upon your suggestion. The bam2fastq output says: [samopen] SAM header is present: 7 sequences. This looks like paired data from lane 198. Output will be in IMmitoJF4_1.fq and IMmitoJF4_2.fq 15962489 sequences in the BAM file 15962489 sequences exported WARNING: 8855375 reads could not be matched to a mate and were not exported

However, I expect every read in the new.bam file to be paired. Since, all pairs in, then select every pair where at least one read is mapped, means all pairs out.

I did get all pairs out when I followed Filtering multiple flags with SAMtools, but they were the unmapped pairs.

ADD REPLY
0
Entering edit mode
10.8 years ago
matted 7.8k

Since it sounds like you want both reads of a pair that meet your conditions, I think you'll have to break it into several steps:

# exactly one end mapped
samtools view -b -F 4 -f 8 in.bam > out1.bam

# exactly two ends mapped
samtools view -b -F 12 in.bam > out2.bam

You can convert out1.bam and out2.bam to paired-end fastq files for use in assembly. Each read present in these bams should also have its mate present in the bam.

See http://picard.sourceforge.net/explain-flags.html for an easy way to play with flag settings. You may need to also filter out something like -F 1792 (no secondary alignments, failed reads, or PCR duplicates), depending on what flags the aligner set in generating in.bam.

ADD COMMENT
0
Entering edit mode

I tried this approach but there are still reads that are missing mates.

bam2fastq output: This looks like paired data from lane 198. Output will be in ALLmitoKK_1.fq and ALLmitoKK_2.fq 15962489 sequences in the BAM file 15962489 sequences exported WARNING: 8855375 reads could not be matched to a mate and were not exported

ADD REPLY
1
Entering edit mode

Try the solution provided by Pierre. I think it should work.

ADD REPLY
0
Entering edit mode

Which out bam did you use here? It's suspicious that the error message / numbers you report are identical to the ones you gave for the -F 4 case above.

What aligner made the bam? Are the flags set correctly in the bam? Does bam2fastq expect a certain sort order that you're not giving it?

ADD REPLY
0
Entering edit mode
5.2 years ago
ying.kev • 0

sambamba has a powerful filtering syntax that allows it to be done in one step

https://github.com/biod/sambamba/wiki/%5Bsambamba-view%5D-Filter-expression-syntax

sambamba view -f bam -F "not (unmapped and mate_is_unmapped)" ${bam} > ${output.bam}
ADD COMMENT

Login before adding your answer.

Traffic: 3223 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6