Why do I lose sequences when converting my bam files into fastq?
2
1
Entering edit mode
2.9 years ago
DNAngel ▴ 250

Hi all,

I used bamToFastq to convert my bamfiles into R1 and R2 fastq files. I tried samtools fastq before but found it was erroneous (so many missing sequences). However, I still have an issue with bamTofastq from bedtools. When I add the read counts obtained from grep -c "@" R1.fastq and grep -c "@" R2.fastq, it is always slightly less than the count from the bamfile samtools view -c in.bamfile. Why might this be the case? I haven't found any documentation to suggest that this is a normal thing. The R1 and R2 fastq counts should equal the counts in the bamfile so what am I doing wrong with the conversion??

Thank you.

fastq bam bamToFastq • 1.7k views
ADD COMMENT
3
Entering edit mode

@ is a valid quality encoding character in FASTQ QUAL lines, so you may want to use the simpler line_count / 4 formula to count number of reads in the FASTQ.

ADD REPLY
0
Entering edit mode

Or simply add whatever is next to @ in your grep. e.g. @A00500 (generally sequencer serial). That will only count line 1 from each fastq record.

ADD REPLY
0
Entering edit mode

Yea I tried both ways, same thing. Just less reads in my fastq files than in my bamfiles.

ADD REPLY
2
Entering edit mode
2.9 years ago

it is always slightly less than the count from the bamfile samtools view -c in.bamfile. Why might this be the case?

you bam contains supplementary, secondary alignments.

ADD COMMENT
0
Entering edit mode
ADD REPLY
0
Entering edit mode

Is this something that I can safely ignore then? I used sambamba to get my bamfiles because samtools was just too unbelievably slow.

ADD REPLY
0
Entering edit mode

Yes you are right. There are secondary alignments present. How does bamToFastq know then to not convert these reads into fastq reads??

ADD REPLY
1
Entering edit mode
2.9 years ago

Are you absolutely sure that every single read in the bam exists in one and only one line? That you have zero supplemental or secondary reads?

ADD COMMENT
0
Entering edit mode

Agreed with this response. I have encountered this in my work; e.g. after STAR alignment, my bam file contains multimapping (e.g. a read that aligns to two locations could appear twice or more in the bam file).

Why don't you try samtools view in.bamfile | cut -f1 | sort -u | wc -l to get a count of the number of unique read names in your bam file?

ADD REPLY
0
Entering edit mode

Tried this but it cuts out like half of the sequences. I used sambamba to obtain my bamfiles since it is faster than samtools. For the mapped reads I used sambamba view -f bam -F 'not (unmapped or mate_is_unmapped)' in.bam and then the opposite for unmapped reads (just took the not out). samtools view -c in.bam for example gives 436840 reads, but your command gives 218264.

ADD REPLY
0
Entering edit mode

I checked, I do have secondary alignments. So I am now curious to know how bamToFastq works. When I check for my reads and exclude secondary alignments in my bamfile, the number makes sense. It matches the number of R1 and R2 fastq reads. How does bamToFastq ignore secondary alignments during conversion? I just want to be sure it is okay and I can continue with my fastq files confident that no secondary alignments were included.

ADD REPLY
0
Entering edit mode

It's likely ignoring all reads which are flagged as secondary and supplemental. This wouldn't work right if you messed around with your bam and removed primary alignments. If you really want to be sure, grep out and count the # of unique read names.

ADD REPLY

Login before adding your answer.

Traffic: 1832 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6