Question

optimise htseq count performance by choosing proper samtools sort options

0

Entering edit mode

4.3 years ago

2822462298 ▴ 120

Hi all,

I am currently using samtools to sort my bam files by positions (as default), then I used htseq to obtain read counts. Initially, I got massive 'Mate records missing' warnings. Then, I realized that htseq assumed the files were sorted by name, so I included the '-r pos' option and re-run the htseq. Then, I got less 'Mate records missing' warnings but they are still there...So my question would be: 1. Is there a way I can totally eliminate the warnings? 2. Which of the following pipeline is better?

samtool sort by name + htseq without -r pos
samtool sort by position + htseq with -r pos

I referred to the developer's posts: https://github.com/simon-anders/htseq/issues/37 but I still couldn't figure out how I should improve the process properly.

htseq RNA-Seq RNA rna-seq samtools • 2.0k views

ADD COMMENT • link 4.3 years ago by 2822462298 ▴ 120

2

Entering edit mode

As @Devon suggested in an earlier question you should use featureCounts instead. It is much faster, can auto sort files as needed and will create an analysis ready count matrix from set of BAM files you provide to it making downstream import easy.

ADD REPLY • link 4.3 years ago by GenoMax 141k

0

Entering edit mode

Thanks! In that case I do not need to sort the bam file using samtools right?

ADD REPLY • link 4.3 years ago by 2822462298 ▴ 120

0

Entering edit mode

The BAM file still needs to be sorted, and AFAIK there are slightly different requirements for paired-end (fragment) and single-end (read) quantification. Basically, featureCounts will try to fix the mate pairs if it detects inconsistencies, but it's much slower than actual read counting, so it's best to make sure your files are sorted correctly. Samtools has options to fix unpaired mate reads or remove unpaired reads altogether.

ADD REPLY • link 4.3 years ago by predeus ★ 1.9k

0

Entering edit mode

Thanks! Actually I have tried all name, postion, and unsorted bam files for featurecounts. The outputs were pretty much the same with minor differences.

ADD REPLY • link 4.3 years ago by 2822462298 ▴ 120