Should I filter my reads before running bedtools multicov ?
1
1
Entering edit mode
9.1 years ago
Pei ▴ 220

Hi all:

I am working on pair-end RNA-seq data to get read counts by running "bedtools multicov" on the bam file generated by Tophat.

Should I perform some kinds of filtering of reads before bedtools? (such as remove duplicated reads by samtools or remove poor-quality reads)

Your opinion must be most valuable.

Thanks in advance!

Best

RNA-Seq • 4.4k views
ADD COMMENT
2
Entering edit mode
9.1 years ago

Do not use bedtools multicov to get counts if you plan to use them for statistics. Use featureCounts or htseq-count instead.

You generally do not have to do any filtering, the defaults for featureCounts or htseq-count typically suffice.

ADD COMMENT
0
Entering edit mode

Thanks Devon.

But could you give some more explanation for not using bedtools multicov?

ADD REPLY
1
Entering edit mode

It gives incorrect counts for statistics. The statistics you're trying to do need unique counts, which multicov won't produce, since that isn't its purpose.

ADD REPLY
0
Entering edit mode

Hi Devon,

Can you please tell what is the purpose of the bedtools multivcov ? And what kind of statistics you are talking.? I have used bedtools multicov many times to get read counts for downstream differential expression analysis

Thanks a lot

Chirag

ADD REPLY
0
Entering edit mode

Say you have DNAseq samples and want to know what the coverage in an area is, then multicov would be quite useful. If you're trying to get counts for DESeq2 or edgeR or something similar then your counts will tend to be wrong.

ADD REPLY
0
Entering edit mode

Hi Devon,

Thanks for your response. Can you please tell me why bedtools multicov would be wrong for DESeq2 and edgeR? What about samtools idxstats? is it useful for DESeq and edgeR?

Thanks

Chirag

ADD REPLY
1
Entering edit mode

This boils two to how one should deal with the following:

  1. Reads that multimap within the genome
  2. Reads that don't multimap, but overlap more than one feature (typically a gene, but you could use DESeq2/edgeR/etc. for other things).

In the case of #1, multicov will generally include these, though there's likely a flag to prevent that. These should never be included in the counts given to DESeq2 and the others since it's breaks the statistical assumptions.

In the case of #2, my understanding was always that multicov will count a read for all features that it overlaps. This also violates the statistical assumptions used in DEseq2 and the others.

For these reasons I think most people use featureCounts these days (it's MUCH faster than htseq-count).

If you've aligned to the transcriptome and are trying to get transcript level counts, then yes you can use samtools idxstats, but note that you first need to remove multimappers. In this case I would strongly encourage you to instead use the BAM file with Salmon and then to use Sleuth instead of edgeR/DESeq2/etc. It's very likely that you'll get better results (in fact, you'll probably see this becoming "the standard method" that everyone uses over the next couple years).

ADD REPLY

Login before adding your answer.

Traffic: 1552 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6