Question: BAM to FASTQ picard or samtools
2
gravatar for anoops
20 months ago by
anoops20
anoops20 wrote:

Hello,

I am trying to convert a batch of BAM files to FASTQs. I started out testing SAMTOOLS (collate/bam2fq) and PICARD (SAMTOFATQ). On the outset the numbers seemed OK but the statistics suggests that the SAMTOOLS out has twice the amount of duplicates as the Picard out.

Has anyone experienced this? I am not sure if it is a samtools problem or I am not comprehending the QC stats.

Any advice/recommendation/comments are welcome.

Thanks!

PS: In both cases I am outputting both first end of the pair and the second end of the pair as separate files.

UPDATED: The commands used were:

Samtools

samtools collate -o name-collate.bam sample.bam
samtools fastq -1 sample_1.fastq.gz -2 sample_2.fastq.gz -0 sample_0.fastq.gz name-collate.bam

Picard

java -Xmx2g -jar picard.jar SamToFastq I=sample.bam FASTQ=sample_1p.fastq.gz SECOND_END_FASTQ=sample_2p.fastq.gz UNPAIRED_FASTQ=sample_0p.fastq.gz

Fasqc check

fastqc -o fastqc_out/ sample_1p.fastq.gz

Picard QC

Picard

Samtools QC

Samtools

sequencing next-gen assembly • 3.9k views
ADD COMMENTlink modified 20 months ago by h.mon29k • written 20 months ago by anoops20
1

It would be a big help if you could provide the command lines used

ADD REPLYlink written 20 months ago by swbarnes27.6k

I didn't include them because they were default. They are included now. Thanks in advance!!!

ADD REPLYlink written 20 months ago by anoops20

FYI, you do not need collate. A simple sort by name with the -n option of samtools sort will "restore" the read order as it was obtained from the sequencer, so pretty much random. This you can directly pipe into samtools fastq:

samtools sort -n in.bam | samtools fastq -1 sample_1.fastq.gz -2 sample_2.fastq.gz -0 sample_0.fastq.gz -
ADD REPLYlink written 20 months ago by ATpoint32k

Good tip, thanks ATpoint

ADD REPLYlink written 20 months ago by anoops20

Actually, collate is faster than samtools sort, and works fine for your purpose. From man samtools:

A faster alternative to a full query name sort, collate ensures that reads of the same name are grouped together in contiguous groups, but doesn't make any guarantees about the order of read names between groups.

ADD REPLYlink modified 20 months ago • written 20 months ago by h.mon29k

Hello anoops!

It appears that your post has been cross-posted to another site: http://seqanswers.com/forums/showthread.php?t=83934

This is typically not recommended as it runs the risk of annoying people in both communities.

ADD REPLYlink written 20 months ago by Pierre Lindenbaum127k

Sorry, did not realize. Will keep in mind.

ADD REPLYlink written 20 months ago by anoops20
5
gravatar for h.mon
20 months ago by
h.mon29k
Brazil
h.mon29k wrote:

By default, picard don't output non-primary alignments, and samtools does. These secondary alignments which samtools fastq outputs should have two effects: an increase in duplication rate, as you noticed, and a larger number of reads - can you confirm this?

Probably Picard behavior is what you want. If you read the samtools manual carefully, you will see how to avoid outputting non-primary alignments.

ADD COMMENTlink written 20 months ago by h.mon29k

Thank you h.mon. I see that the collate routine has the option to output primary alignments only. It seems like Picard is preferable for this purpose.

Actually the read count is what triggered the problem, they both output the exact same number according to Fastqc. "Total Sequences : 49148031" in this particular case. So the higher duplication in samtools made me doubt the results.

ADD REPLYlink modified 20 months ago • written 20 months ago by anoops20
1

Then I guess this is just an artifact, because after samtools collate the order of the reads has been changed and due to how FastQC measures duplication:

To cut down on the memory requirements for this module only sequences which first appear in the first 100,000 sequences in each file are analysed

You can sort the fastq files and repeat the FastQC analysis.

ADD REPLYlink written 20 months ago by h.mon29k

That makes more sense now, I will try the sorting. Thanks!

ADD REPLYlink written 20 months ago by anoops20
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1343 users visited in the last hour