Question: optimise htseq count performance by choosing proper samtools sort options
0
gravatar for 2822462298
5 months ago by
282246229850
282246229850 wrote:

Hi all,

I am currently using samtools to sort my bam files by positions (as default), then I used htseq to obtain read counts. Initially, I got massive 'Mate records missing' warnings. Then, I realized that htseq assumed the files were sorted by name, so I included the '-r pos' option and re-run the htseq. Then, I got less 'Mate records missing' warnings but they are still there...So my question would be: 1. Is there a way I can totally eliminate the warnings? 2. Which of the following pipeline is better?

  1. samtool sort by name + htseq without -r pos
  2. samtool sort by position + htseq with -r pos

I referred to the developer's posts: https://github.com/simon-anders/htseq/issues/37 but I still couldn't figure out how I should improve the process properly.

rna-seq samtools rna htseq • 164 views
ADD COMMENTlink modified 5 months ago • written 5 months ago by 282246229850
2

As @Devon suggested in an earlier question you should use featureCounts instead. It is much faster, can auto sort files as needed and will create an analysis ready count matrix from set of BAM files you provide to it making downstream import easy.

ADD REPLYlink modified 5 months ago • written 5 months ago by genomax85k

Thanks! In that case I do not need to sort the bam file using samtools right?

ADD REPLYlink written 5 months ago by 282246229850

The BAM file still needs to be sorted, and AFAIK there are slightly different requirements for paired-end (fragment) and single-end (read) quantification. Basically, featureCounts will try to fix the mate pairs if it detects inconsistencies, but it's much slower than actual read counting, so it's best to make sure your files are sorted correctly. Samtools has options to fix unpaired mate reads or remove unpaired reads altogether.

ADD REPLYlink written 5 months ago by predeus1.4k

Thanks! Actually I have tried all name, postion, and unsorted bam files for featurecounts. The outputs were pretty much the same with minor differences.

ADD REPLYlink modified 5 months ago • written 5 months ago by 282246229850
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1260 users visited in the last hour