optimise htseq count performance by choosing proper samtools sort options
0
0
Entering edit mode
17 months ago
2822462298 ▴ 60

Hi all,

I am currently using samtools to sort my bam files by positions (as default), then I used htseq to obtain read counts. Initially, I got massive 'Mate records missing' warnings. Then, I realized that htseq assumed the files were sorted by name, so I included the '-r pos' option and re-run the htseq. Then, I got less 'Mate records missing' warnings but they are still there...So my question would be: 1. Is there a way I can totally eliminate the warnings? 2. Which of the following pipeline is better?

1. samtool sort by name + htseq without -r pos
2. samtool sort by position + htseq with -r pos

I referred to the developer's posts: https://github.com/simon-anders/htseq/issues/37 but I still couldn't figure out how I should improve the process properly.

htseq RNA-Seq RNA rna-seq samtools • 584 views
2
Entering edit mode

As @Devon suggested in an earlier question you should use featureCounts instead. It is much faster, can auto sort files as needed and will create an analysis ready count matrix from set of BAM files you provide to it making downstream import easy.

0
Entering edit mode

Thanks! In that case I do not need to sort the bam file using samtools right?

0
Entering edit mode

The BAM file still needs to be sorted, and AFAIK there are slightly different requirements for paired-end (fragment) and single-end (read) quantification. Basically, featureCounts will try to fix the mate pairs if it detects inconsistencies, but it's much slower than actual read counting, so it's best to make sure your files are sorted correctly. Samtools has options to fix unpaired mate reads or remove unpaired reads altogether.

0
Entering edit mode

Thanks! Actually I have tried all name, postion, and unsorted bam files for featurecounts. The outputs were pretty much the same with minor differences.