BAM to FASTQ picard or samtools
1
3
Entering edit mode
5.7 years ago
anoops ▴ 40

Hello,

I am trying to convert a batch of BAM files to FASTQs. I started out testing SAMTOOLS (collate/bam2fq) and PICARD (SAMTOFATQ). On the outset the numbers seemed OK but the statistics suggests that the SAMTOOLS out has twice the amount of duplicates as the Picard out.

Has anyone experienced this? I am not sure if it is a samtools problem or I am not comprehending the QC stats.

Any advice/recommendation/comments are welcome.

Thanks!

PS: In both cases I am outputting both first end of the pair and the second end of the pair as separate files.

UPDATED: The commands used were:

Samtools

samtools collate -o name-collate.bam sample.bam
samtools fastq -1 sample_1.fastq.gz -2 sample_2.fastq.gz -0 sample_0.fastq.gz name-collate.bam

Picard

java -Xmx2g -jar picard.jar SamToFastq I=sample.bam FASTQ=sample_1p.fastq.gz SECOND_END_FASTQ=sample_2p.fastq.gz UNPAIRED_FASTQ=sample_0p.fastq.gz

Fasqc check

fastqc -o fastqc_out/ sample_1p.fastq.gz

Picard QC

Picard

Samtools QC

Samtools

next-gen sequencing Assembly • 10.0k views
ADD COMMENT
1
Entering edit mode

It would be a big help if you could provide the command lines used

ADD REPLY
0
Entering edit mode

I didn't include them because they were default. They are included now. Thanks in advance!!!

ADD REPLY
1
Entering edit mode

FYI, you do not need collate. A simple sort by name with the -n option of samtools sort will "restore" the read order as it was obtained from the sequencer, so pretty much random. This you can directly pipe into samtools fastq:

samtools sort -n in.bam | samtools fastq -1 sample_1.fastq.gz -2 sample_2.fastq.gz -0 sample_0.fastq.gz -
ADD REPLY
0
Entering edit mode

Good tip, thanks ATpoint

ADD REPLY
2
Entering edit mode

Actually, collate is faster than samtools sort, and works fine for your purpose. From man samtools:

A faster alternative to a full query name sort, collate ensures that reads of the same name are grouped together in contiguous groups, but doesn't make any guarantees about the order of read names between groups.

ADD REPLY
0
Entering edit mode

Hello anoops!

It appears that your post has been cross-posted to another site: http://seqanswers.com/forums/showthread.php?t=83934

This is typically not recommended as it runs the risk of annoying people in both communities.

ADD REPLY
0
Entering edit mode

Sorry, did not realize. Will keep in mind.

ADD REPLY
5
Entering edit mode
5.7 years ago
h.mon 35k

By default, picard don't output non-primary alignments, and samtools does. These secondary alignments which samtools fastq outputs should have two effects: an increase in duplication rate, as you noticed, and a larger number of reads - can you confirm this?

Probably Picard behavior is what you want. If you read the samtools manual carefully, you will see how to avoid outputting non-primary alignments.

ADD COMMENT
2
Entering edit mode

I think that, nowadays, samtools fastq by default removes non-primary alignments. Default is

-F INT
Do not output alignments with any bits set in INT present in the FLAG field. INT can be specified in hex by beginning with `0x' (i.e. /^0x[0-9A-F]+/) or in octal by beginning with `0' (i.e. /^0[0-7]+/) [0x900]. This defaults to 0x900 representing filtering of secondary and supplementary alignments.

http://www.htslib.org/doc/samtools-fasta.html

ADD REPLY
0
Entering edit mode

Thank you h.mon. I see that the collate routine has the option to output primary alignments only. It seems like Picard is preferable for this purpose.

Actually the read count is what triggered the problem, they both output the exact same number according to Fastqc. "Total Sequences : 49148031" in this particular case. So the higher duplication in samtools made me doubt the results.

ADD REPLY
1
Entering edit mode

Then I guess this is just an artifact, because after samtools collate the order of the reads has been changed and due to how FastQC measures duplication:

To cut down on the memory requirements for this module only sequences which first appear in the first 100,000 sequences in each file are analysed

You can sort the fastq files and repeat the FastQC analysis.

ADD REPLY
0
Entering edit mode

That makes more sense now, I will try the sorting. Thanks!

ADD REPLY

Login before adding your answer.

Traffic: 1868 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6