Question: What is the Trinity setting max_pct_stdev?
0
gravatar for Ekarl2
3.4 years ago by
Ekarl280
Ekarl280 wrote:

In the Trinity manual, I read the following:

--max_pct_stdev <int> :maximum pct of mean for stdev of kmer coverage across read (default: 200)

What does this mean in more detail? I found an older discussion:

http://ivory.idyll.org/blog/trinity-in-silico-normalize.html

where they explain it as:

Here, the per-read pct_dev is defined as the deviation in k-mer coverage divided by the average k-mer coverage, times 100 (to make it a percent). If the deviation is high, that indicates that the read is likely to contain many errors, since high-coverage reads with low-coverage k-mers shouldn't happen. Trinity sets a cutoff of 100: if the deviation is as big as the average, the read should go away

So is max_pct_stdev just std kmer coverage / avg kmer coverage * 100?

Does this mean that a high value for this statistic mean that there are some kmers with really low kmer coverage (thus increasing the std by a lot) compared with the average and assuming a generally large sequence coverage, these reads are probably bad? Or were do we get the "high-coverage read" from?

max_pct_stdev rna-seq trinity • 886 views
ADD COMMENTlink modified 3.4 years ago by RamRS17k • written 3.4 years ago by Ekarl280
2
gravatar for RamRS
3.4 years ago by
RamRS17k
Houston, TX
RamRS17k wrote:

From what I understand:

For a 100b read with coverage=30, ideally, all 25-mers from it should ideally be covered around 30X. (this is just an example with random values)

Though I am not entirely sure why that assumption is made, it seems to be a rare case where one encounters k-mers with orders of magnitude higher coverage than the read they are from. However, it is entirely possible, owing to sequencing errors, that a kmer has really low coverage compared to the read. If a read has multiple such erroneous k-mers, distributed across the read, it would increase the STD DEV in the set of kmer coverage values but may not filter out the read itself at QC. Such a read can be considered suboptimal and discarded without significant loss to the assembly process.

ADD COMMENTlink written 3.4 years ago by RamRS17k

Thank you for your detailed explanation. Have I understood the equation ("std kmer coverage / avg kmer coverage * 100") correctly?

ADD REPLYlink written 3.4 years ago by Ekarl280
1

That seems right. Think of it as "what percent of the mean k-mer coverage can the k-mer coverage sd be, at max?" 

ADD REPLYlink written 3.4 years ago by RamRS17k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1000 users visited in the last hour