One of the situation when bam to fastq conversion might be required:
Say, I have a problem where I want to take the reads that are aligned to chromosome 2 of one species. And, I want to see how much (fraction) of these same reads align to chr 2 of another species or population and with how much mismatches.
I used several method but wasn't able to get the number of reads in
fastq files as indicated in
flagstat statistics output. This bitwise method seems to be more appropriate, if people want to extract the exact amount of reads designated by
samtools flagstat; but if others have different opinion please suggest.
Step 01: Select reads from Chromosome 2 $ samtools view -h -b realigned_2ms04h.bam 2 > chr2.2ms04h.bam # sort the bam alignment $ samtools sort -n chr2.2ms04h.bam chr2.sorted.2ms04h.bam # fix mate pairs/info $ samtools fixmate chr2.sorted.2ms04h.bam chr2.fixmate.2ms04h.bam # Let's check the simple statistics of the bam file $ samtools flagstat chr2.fixmate.2ms04h.bam > chr2.fixmate.stats # Stats output from "chr2.fixmate.stats" file: 7549722 + 0 in total (QC-passed reads + QC-failed reads) 1421823 + 0 duplicates 7549722 + 0 mapped (100.00%:-nan%) 7549722 + 0 paired in sequencing 3771662 + 0 read1 3778060 + 0 read2 6936646 + 0 properly paired (91.88%:-nan%) 6937102 + 0 with itself and mate mapped 612620 + 0 singletons (8.11%:-nan%) 0 + 0 with mate mapped to a different chr 0 + 0 with mate mapped to a different chr (mapQ>=5) Step 02: Convert the mate-fixed bam to fastq using one of the below methods # A) using bam2fastq $ bamToFastq -i chr2.fixmate.2ms04h.bam -fq chr2.2ms04h.R1 -fq2 chr2.2ms04h.R2 # B) using Picard $ java -jar picard2.9.jar SamToFastq I=chr2.fixmate.2ms04h.bam FASTQ=R1.fq SECOND_END_FASTQ = R2.fq UNPAIRED_FASTQ = unPaired.fq
bam2fastq and picard didn't work for me.
Step 02 - C
If, you want to select only the properly paired reads use `bitwise flag -f 2`. if you want to remove duplicated reads use `bitwise flag -F 1024` # 2-C (optional extra step) : Select properly paired (bitwise flag: -f 2) and remove duplicates (flag: -F 1024) $ samtools view -b -f 2 -F 1024 chr2.fixmate.2ms04h.bam > chr2.Paired_DeDup.bam $ samtools flagstat chr2.Paired_DeDup.bam > chr2.Paired_DeDup.stats # Stats output from "chr2.Paired_DeDup.stats" file: 5629430 + 0 in total (QC-passed reads + QC-failed reads) 0 + 0 duplicates 5629430 + 0 mapped (100.00%:-nan%) 5629430 + 0 paired in sequencing 3061503 + 0 read1 2567927 + 0 read2 5629430 + 0 properly paired (100.00%:-nan%) 5629430 + 0 with itself and mate mapped 0 + 0 singletons (0.00%:-nan%) 0 + 0 with mate mapped to a different chr 0 + 0 with mate mapped to a different chr (mapQ>=5) # 2-C: Now, convert the bam files into Read1/2 matepairs, using bitwise flag $ samtools view -uf64 chr2.Paired_DeDup.bam |samtools bam2fq - |gzip > chr2.2ms04h_R1.fq.gz $ samtools view -uf128 chr2.Paired_DeDup.bam |samtools bam2fq - |gzip > chr2.2ms04h_R2.fq.gz # Unzip the files and count the number of reads in R1, R2 read pairs # Count $ cat chr2.2ms04h_R1.fq | echo $((`wc -l`/4)) 3061503 $ cat chr2.2ms04h_R2.fq | echo $((`wc -l`/4)) 2567927
- You can see that the number of reads reported by
samtools flagstatand reads (R1 and R2) obtained using
bitwise methodare the same.
- If you didn't remove the the duplicates, singletons reads, those can also be extracted using bitwise flag and will be equal to the numbers reported in
samtools flagstatfile chr2.fixmate.stats.