Question

Removing contigs with improbable coverage using R?

0

Entering edit mode

8.9 years ago

Lesley Sitter ▴ 600

Hi,

I have a question involving assemblies and checking the quality of the contigs in those assemblies.

The data I have consists of Pair ended reads, I merged them using PEAR. The reads that were left unassembled by PEAR because they did not have enough overlap were fed to SOAP denovo for assembling. The two assemblies were merged. This gave me a large set of contigs and to check their quality I mapped my reads back to them. The data then looks something like this;

contig_name    contig_coverage
1                     14
2                     9
3                     8
4                     15
5                     14
6                     13
7                     3.000

The distribution looks something like this

distribution

On the left is the distribution with the contig that has 3000x coverage, and on the right I used boxplot to find outliers and removed them. It is pretty apparent that contig 7 is an outlier, but the way I removed it is not a proper way because the boxplot method is only useful for a normally distributed dataset and this looks more like log normally distributed.

I have tried using fitdistr() to find the exact type of distribution but it does not produce a definitive answer.

Our statistician also mentioned using generalized linear models, then take the deviance and divide by degrees of freedom to get the distribution. and the closest to 1 is the distribution which I can use to determine the type of outlier detection I need... but I can't get this to work properly (see example down below)

glm1 <- glm(contig_coverage~1,data=data_set, family = gaussian(log))
glm2 <- glm(contig_coverage~1,data=data_set, family = poisson(log))
glm3 <- glm(contig_coverage~1,data=data_set, family = Gamma(log))

Can anybody help me with a rule/method/function with which I can find out which contigs have improbable coverage in R so that I can remove them from the final assembly?

outlier assembly coverage-statistics read-mapping • 2.6k views

ADD COMMENT • link updated 16 months ago by Ram 43k • written 8.9 years ago by Lesley Sitter ▴ 600

0

Entering edit mode

Some doubts:

Is it at all a valid approach to make a decision based only on the coverage? you'd remove the contig with the largest amount of evidence in that case.
This contig might contain a large stretch of repeats, should you check for that?
I guess it's genomic dna and reads are also genomic dna? If not then other concerns would come up.
If you don't know the distribution, you could go for non-parametric trimming e.g. trim based on percentile (highest 5% and lowest?)
Did you try to fit negative binomial, that seems to be the standard approach today?

ADD REPLY • link updated 16 months ago by Ram 43k • written 8.9 years ago by Michael 54k

0

Entering edit mode

Thanks for reading my question :D to answer your points

The goal is a pipeline that can do a complete assembly, quality filtering, and eventually SNP calling so that the output gets standardized and the laboratory analysts don't have to worry about the data handling. We would rather sacrifice a small number of possibly problematic contigs than having to do a manual check each time we want to run this pipeline.
I have checked the top ten outliers and they contain contaminations like TE's so it's not valuable in this test and we expect these types of contigs coming up regularly that is why we want to remove them.
You are correct, the reads are genomic DNA and our assemblies therefore are also genomic DNA
We have tried some non-parametric approaches, but the problem is that in this case the 3000x coverage is highly improbable, but if for example we had a perfect run and the max coverage we have is for example 250, this would fall within our predicted distribution but when trimming for example 5% we would remove our best contigs even though they would be considered to be accurate according to the distribution.
Have not yet tried that, thank you for suggesting it, I will look into it right away :D

ADD REPLY • link updated 16 months ago by Ram 43k • written 8.9 years ago by Lesley Sitter ▴ 600

0

Entering edit mode

How did you merge the reads? If I understood you correctly, you assembled using only the unmerged reads, is this correct? Is this genome or transcriptome? Why do you want to remove them?

Anyway, if your outliers are like contig 7, finding them should be straightforward.

ADD REPLY • link 8.9 years ago by h.mon 35k

0

Entering edit mode

Sorry I wasn't clear on that part. We merged reads using PEAR (Pair Ended reAd mergeR), that way we end up with lots of contigs. The read pairs that couldn't be merged by PEAR were fed to SOAP and assembled. Both assemblies combined results in final assembly.

Contig 7 is straightforward, but this is just and example, as you can see in the graph there isn't a definitive threshold where correct contigs end and improbable ones start so we want some function that can set that threshold for us, preferabbly one that takes into account the distribution so that it not just removes contigs based on a predetermined static value.

ADD REPLY • link updated 16 months ago by Ram 43k • written 8.9 years ago by Lesley Sitter ▴ 600

0

Entering edit mode

PEAR is not an assembler as in "genome assembler", I would rather call it "read merger".

ADD REPLY • link 8.9 years ago by h.mon 35k

Ram · Answer 1 · 2015-06-10

I think you shouldn't base the decision on coverage alone, because high coverage is an indication of presence of repetitive elements as I suspected, but they might be surrounded by 'good' sequence. Instead, I would try to mask repeats using RepeatMasker and attempt to align reads to non-masked regions only. Imagine you have a region containing a long tandem repeat, this region would yield abnormal coverage as a peak, while you couldn't see the TR in the assembly, however the surrounding regions might be perfectly fine.

Also, variant calling pipelines might have a parameter for maximum local coverage to avoid similar problems.

Does this make sense?

Wrt your response:

We have tried some non-parametric approaches, but the problem is that in this case the 3000x coverage is highly improbable, but if for example we had a perfect run and the max coverage we have is for example 250, this would fall within our predicted distribution but when trimming for example 5% we would remove our best contigs even though they would be considered to be accurate according to the distribution.

In this case fitting a neg. binomial using fitdistr and using a p value of 0.05 or 0.01 as the cutoff, - or using negative binomial regression - should yield better results, meaning that trimming would be done conditionally only.