Question: Should I filter my reads before running bedtools multicov ?
1
gravatar for lin.pei26
3.6 years ago by
lin.pei2670
China
lin.pei2670 wrote:

Hi all:

I am working on pair-end RNA-seq data to get read counts by running "bedtools multicov" on the bam file generated by Tophat.

Should I perform some kinds of filtering of reads before bedtools ? (such as remove duplicated reads by samtools or remove poor-quality reads )

Your opinion must be most valuable.

Thanks in advance!

Best,

 

rna-seq • 1.9k views
ADD COMMENTlink modified 5 days ago by Biostar ♦♦ 20 • written 3.6 years ago by lin.pei2670
1
gravatar for Devon Ryan
3.6 years ago by
Devon Ryan90k
Freiburg, Germany
Devon Ryan90k wrote:

Do not use bedtools multicov to get counts if you plan to use them for statistics. Use featureCounts or htseq-count instead.

You generally do not have to do any filtering, the defaults for featureCounts or htseq-count typically suffice.

ADD COMMENTlink written 3.6 years ago by Devon Ryan90k

Thanks Devon.

But could you give some more explaination for not using bedtools multicov ?

 

ADD REPLYlink written 3.6 years ago by lin.pei2670
1

It gives incorrect counts for statistics. The statistics you're trying to do need unique counts, which multicov won't produce, since that isn't its purpose.

ADD REPLYlink written 3.6 years ago by Devon Ryan90k

Hi Devon,

Can you please tell what is the purpose of the bedtools multivcov ? And what kind of statistics you are talking.? I have used bedtools multicov many times to get read counts for downstream differential expression analysis 

Thanks a lot

Chirag

ADD REPLYlink written 3.6 years ago by Chirag Parsania1.4k

Say you have DNAseq samples and want to know what the coverage in an area is, then multicov would be quite useful. If you're trying to get counts for DESeq2 or edgeR or something similar then your counts will tend to be wrong.
 

ADD REPLYlink written 3.6 years ago by Devon Ryan90k

Hi Devon,

Thanks for your response. Can you please tell me why bedtools multicov would be wrong for DESeq2 and edgeR ? What about samtools idxstats ? is it useful for DESeq and edgeR?

Thanks

Chirag

ADD REPLYlink written 3.6 years ago by Chirag Parsania1.4k
1

This boils two to how one should deal with the following:

  1. Reads that multimap within the genome
  2. Reads that don't multimap, but overlap more than one feature (typically a gene, but you could use DESeq2/edgeR/etc. for other things).

In the case of #1, multicov will generally include these, though there's likely a flag to prevent that. These should never be included in the counts given to DESeq2 and the others since it's breaks the statistical assumptions.

In the case of #2, my understanding was always that multicov will count a read for all features that it overlaps. This also violates the statistical assumptions used in DEseq2 and the others.

For these reasons I think most people use featureCounts these days (it's MUCH faster than htseq-count).

If you've aligned to the transcriptome and are trying to get transcript level counts, then yes you can use samtools idxstats, but note that you first need to remove multimappers. In this case I would strongly encourage you to instead use the BAM file with Salmon and then to use Sleuth instead of edgeR/DESeq2/etc. It's very likely that you'll get better results (in fact, you'll probably see this becoming "the standard method" that everyone uses over the next couple years).

ADD REPLYlink written 3.6 years ago by Devon Ryan90k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 904 users visited in the last hour