Differing number of reads for read 1 and 2 in fastq's from a subset bam file
0
0
Entering edit mode
6 weeks ago
michael • 0

I'm working with paired-end wgs data I downloaded from TCGA.

I'm trying to extract the reads that align to a specific region, extract only those reads to two fastq files, one for each pair. Unfortunately, I am getting a different number of reads in both fastq files because some of the reads in file.1.fq are not in file.2.fq.

Based on the answer to this question (Why number of #read1 and #read2 is different in samtools flagstat output?) I've tried subsetting my bam file as follows:

samtools view -bh -F 2048 TCGA_bam.bam chr1:200500500-202500000 chr2:200500500-202500000 > subset_bam.bam

when running samtools flagstat subset_bam.bam on the output file, the number of reads for read1 and read2 do not match. However, I've managed to make them match by 1) sorting by read name and 2) running samtools fixmate.

I've also tried different flags in samtools view including

  • -f 3 -F 3080
  • -f 2 -F 3080
  • -f 2 -F 3080
  • -F 3072

(there may be a few others). However, in each case, after converting subset_bam.bam with samtools fastq the number of reads in the fastq's derived from subset_bam.bam do not match. For example:

$ echo $(cat subset.1.fastq | wc -l)/4 | bc
20057
$ echo $(cat subset.2.fastq | wc -l)/4 | bc
19991

I've also extracted the reads from the original bam file into two fastq's. However, at this stage they have the same number of reads:

$ echo $(cat full_file.1.fq | wc -l)/4 | bc
223500494
$ echo $(cat full_file.2.fq | wc -l)/4 | bc
223500494

So this does not appear to be an issue with the input data, but rather is related to subsetting.

How can I extract reads that fall into a specific region, convert to fastq and ensure I have the same number of reads in each fastq?

bam samtools • 398 views
ADD COMMENT
1
Entering edit mode

you should use -F 2304 , not just -F 2048 , to remove secondary alignments. https://broadinstitute.github.io/picard/explain-flags.html

ADD REPLY
0
Entering edit mode

Hi Pierre, thanks for the suggestion. I've just tried it again with -F 2304 and unfortunately I'm getting the same issue: 30205 for my first fq file and 30071 for my second file. Any idea what is going on?

ADD REPLY

Login before adding your answer.

Traffic: 1048 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6