Question

BAM to FASTQ picard or samtools

3

Entering edit mode

5.7 years ago

anoops ▴ 40

Hello,

I am trying to convert a batch of BAM files to FASTQs. I started out testing SAMTOOLS (collate/bam2fq) and PICARD (SAMTOFATQ). On the outset the numbers seemed OK but the statistics suggests that the SAMTOOLS out has twice the amount of duplicates as the Picard out.

Has anyone experienced this? I am not sure if it is a samtools problem or I am not comprehending the QC stats.

Any advice/recommendation/comments are welcome.

Thanks!

PS: In both cases I am outputting both first end of the pair and the second end of the pair as separate files.

UPDATED: The commands used were:

Samtools

samtools collate -o name-collate.bam sample.bam
samtools fastq -1 sample_1.fastq.gz -2 sample_2.fastq.gz -0 sample_0.fastq.gz name-collate.bam

Picard

java -Xmx2g -jar picard.jar SamToFastq I=sample.bam FASTQ=sample_1p.fastq.gz SECOND_END_FASTQ=sample_2p.fastq.gz UNPAIRED_FASTQ=sample_0p.fastq.gz

Fasqc check

fastqc -o fastqc_out/ sample_1p.fastq.gz

Picard QC

Picard

Samtools QC

Samtools

next-gen sequencing Assembly • 10.0k views

ADD COMMENT • link updated 5.7 years ago by h.mon 35k • written 5.7 years ago by anoops ▴ 40

1

Entering edit mode

It would be a big help if you could provide the command lines used

ADD REPLY • link 5.7 years ago by swbarnes2 14k

0

Entering edit mode

I didn't include them because they were default. They are included now. Thanks in advance!!!

ADD REPLY • link 5.7 years ago by anoops ▴ 40

1

Entering edit mode

FYI, you do not need collate. A simple sort by name with the -n option of samtools sort will "restore" the read order as it was obtained from the sequencer, so pretty much random. This you can directly pipe into samtools fastq:

samtools sort -n in.bam | samtools fastq -1 sample_1.fastq.gz -2 sample_2.fastq.gz -0 sample_0.fastq.gz -

ADD REPLY • link 5.7 years ago by ATpoint 81k

0

Entering edit mode

Good tip, thanks ATpoint

ADD REPLY • link 5.7 years ago by anoops ▴ 40

2

Entering edit mode

Actually, collate is faster than samtools sort, and works fine for your purpose. From man samtools:

A faster alternative to a full query name sort, collate ensures that reads of the same name are grouped together in contiguous groups, but doesn't make any guarantees about the order of read names between groups.

ADD REPLY • link 5.7 years ago by h.mon 35k

0

Entering edit mode

Hello anoops!

It appears that your post has been cross-posted to another site: http://seqanswers.com/forums/showthread.php?t=83934

This is typically not recommended as it runs the risk of annoying people in both communities.

ADD REPLY • link 5.7 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

Sorry, did not realize. Will keep in mind.

ADD REPLY • link 5.7 years ago by anoops ▴ 40

score 5 · Accepted Answer · 2018-08-09

5

Entering edit mode

5.7 years ago

h.mon 35k

By default, picard don't output non-primary alignments, and samtools does. These secondary alignments which samtools fastq outputs should have two effects: an increase in duplication rate, as you noticed, and a larger number of reads - can you confirm this?

Probably Picard behavior is what you want. If you read the samtools manual carefully, you will see how to avoid outputting non-primary alignments.

ADD COMMENT • link 5.7 years ago by h.mon 35k

2

Entering edit mode

I think that, nowadays, samtools fastq by default removes non-primary alignments. Default is

-F INT
Do not output alignments with any bits set in INT present in the FLAG field. INT can be specified in hex by beginning with `0x' (i.e. /^0x[0-9A-F]+/) or in octal by beginning with `0' (i.e. /^0[0-7]+/) [0x900]. This defaults to 0x900 representing filtering of secondary and supplementary alignments.

http://www.htslib.org/doc/samtools-fasta.html

ADD REPLY • link 3.2 years ago by WouterDeCoster 47k

0

Entering edit mode

Thank you h.mon. I see that the collate routine has the option to output primary alignments only. It seems like Picard is preferable for this purpose.

Actually the read count is what triggered the problem, they both output the exact same number according to Fastqc. "Total Sequences : 49148031" in this particular case. So the higher duplication in samtools made me doubt the results.

ADD REPLY • link 5.7 years ago by anoops ▴ 40

1

Entering edit mode

Then I guess this is just an artifact, because after samtools collate the order of the reads has been changed and due to how FastQC measures duplication:

To cut down on the memory requirements for this module only sequences which first appear in the first 100,000 sequences in each file are analysed

You can sort the fastq files and repeat the FastQC analysis.

ADD REPLY • link 5.7 years ago by h.mon 35k

0

Entering edit mode

That makes more sense now, I will try the sorting. Thanks!

ADD REPLY • link 5.7 years ago by anoops ▴ 40