Extract reads from a bam file to have two fastq files with the same number of reads
1
0
Entering edit mode
5.3 years ago

Hi! I'm very new to the field of bioinformatics, in advance, sorry for the mistakes.

I'm working with a bam file called zm.trim.sorted.cp.bam and I ran:

samtools flagstat -@ 3 zm.trim.sorted.cp.bam

I got this:

5604169 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
5599573 + 0 mapped (99.92% : N/A)  
5604169 + 0 paired in sequencing
2804473 + 0 read1
2799696 + 0 read2
5256174 + 0 properly paired (93.79% : N/A)
5594977 + 0 with itself and mate mapped
4596 + 0 singletons (0.08% : N/A)
239951 + 0 with mate mapped to a different chr → ?
10623 + 0 with mate mapped to a different chr (mapQ>=5)

I want to extract all the reads from this bam file so I used:

samtools fastq -1 all.cp.reads_1.fq -2 all.cp.reads_2.fq zm.trim.sorted.cp.bam

I obtained two fastq files: all.cp.reads_1.fq and all.cp.reads_2.fq, and like the flagstat report said, there are 2804473 reads in all.cp.reads_1.fq and 2799696 in all.cp.reads_2.fq.

There are 4777 more reads in all.cp.reads_1.fq than in all.cp.reads_2.fq. I wanted to know what are those reads and how to eliminate them. I want to have the same number of reads in my two fastq files (each read with its mate).

Thank you in advance for your help :)

alignment genome sequencing assembly • 1.1k views
ADD COMMENT
1
Entering edit mode
5.3 years ago
GenoMax 147k

You can use repair.sh from BBMap suite to bring the two files in sync. You will find a guide here.

Minimally you can do (remove .gz if your files are not compressed) :

$ repair.sh in1=broken1.fq.gz in2=broken2.fq.gz out1=fixed1.fq.gz out2=fixed2.fq.gz outs=singletons.fq repair
ADD COMMENT
0
Entering edit mode

It worked great!!! Thank you so much genomax!!!

But what are these reads? According to my flagstat, all my reads should be paired. Why are these reads without its mates?

ADD REPLY
1
Entering edit mode
4596 + 0 singletons (0.08% : N/A)

Some are explained by those.

ADD REPLY

Login before adding your answer.

Traffic: 826 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6