Question: Samtools fastq returns less unmapped than flagstat
gravatar for ciemanek
11 weeks ago by
The Netherlands/Amsterdam
ciemanek130 wrote:

I am trying to include filtering reads in my pipeline. I do this by aligning the reads to a reference sequence with bwa mem (afterwards I run modifying headers with sed and sorting reads by name with samtools sort -n as advised by samtools manual) and outputting the reads that didn't map with samtools fastq. When I use simple test data, there shouldn't me any reads matching reference and that's what I observe with samtools flagstat:

5000 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
0 + 0 mapped (0.00% : N/A)
0 + 0 paired in sequencing
0 + 0 read1
0 + 0 read2
0 + 0 properly paired (N/A : N/A)
0 + 0 with itself and mate mapped
0 + 0 singletons (N/A : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)

However while I run samtools fastq I only get 2500 sequences. I checked if it's not the case that the headers are not unique and maybe samtools is collapsing them it doesn't seem so. Data was generated with InSilicoSeq.

The command I ran on this dataset are as follows:

bwa mem -C -t 4 ref.idx input.fq | sed -r "s/^([^@].*)\t([^\t]*)$/\\1\tZI:Z:\\2/" | samtools sort -n | tee >(samtools flagstat --threads 4 - > output.stats) >(samtools fastq -f 4 -T ZI --threads 4 - > output_filtered.fq) >> /dev/null

I wonder if for some reason samtools is missinterpreting flags or am I missing something.

ADD COMMENTlink modified 8 weeks ago by Biostar ♦♦ 20 • written 11 weeks ago by ciemanek130

+1 for use of tee and process substitution :)

This 2500 sequences, how did you calculate this? Was this printed from samtools fastq to stderr?

ADD REPLYlink written 11 weeks ago by ATpoint26k

Thanks :) I just grepped my filtered fastq for the header (@plasmid) and checked number of printed lines with wc -l, which returned 2500 read headers (which makes sense since output file is 10 000 lines, considering a single read being represented by 4 lines in fastq).

ADD REPLYlink written 11 weeks ago by ciemanek130

Odd, I did quickly run your command on a dummy file and I get the same number of seqs in fq as in the original bam.

ADD REPLYlink written 11 weeks ago by ATpoint26k

Did you also run sorting by name? Apparently, when I run the command without -n option for samtools sort, I get all the reads I should get.

ADD REPLYlink written 11 weeks ago by ciemanek130

I also tried this approach on paired-end data with appropriate flags (samtools fastq -f 13 -T ZI --threads 4 -1 output_filtered_R1.fq -2 output_filtered_R2.fq) and it returns exactly the number of reads it should.

ADD REPLYlink written 11 weeks ago by ciemanek130

What happens if you omit the sed command? Why are you modifying the headers like that anyway? Why not just name the references what you want them to be named? Or if you have to change the headers, extract them with samtools view -H and modify them then add them back to the rest of the .sam?

ADD REPLYlink modified 11 weeks ago • written 11 weeks ago by swbarnes27.0k

Removing sed from the pipeline doesn't change the outcome - only when I remove -n argument from samtools sort I get the number of reads I'm supposed to according to samtools flagstat

ADD REPLYlink written 11 weeks ago by ciemanek130

Did you find out what happened?

ADD REPLYlink written 8 weeks ago by ATpoint26k

Yes - the problem was the dummy data I was generating, the reads headers specifically - it seems that for R1/R2 reads this information was getting lost in BAM file while sorting by name and as a result, twice fewer reads were returned.

ADD REPLYlink written 8 weeks ago by ciemanek130
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1158 users visited in the last hour