Question: How do you remove coverage outliers in sequencing count data?
0
gravatar for biohack92
3.7 years ago by
biohack92150
United States
biohack92150 wrote:

I've recently looked at methylation coverage (bisulfite seq data) in IGV, and there are obvious coverage outliers which I'm interpreting as mismappings of repetitive sequences. How do you identify these outliers from methylation calls/count data (I don't have BAM/SAM files) and remove them?

ADD COMMENTlink modified 3.7 years ago by dariober10k • written 3.7 years ago by biohack92150
3
gravatar for dariober
3.7 years ago by
dariober10k
WCIP | Glasgow | UK
dariober10k wrote:

The quantile function in R is handy in these cases. To discard datapoints in the top 1% (and keep the bottom 99%) do:

xv<- rnbinom(10000, 1, 0.1)     ## Test data
qqcut<- quantile(xv, p= 0.99)   ## Cut off to use in R or awk
xvcut<- xv[xv < qqcut]          ## Keep only points in lower 99%
ADD COMMENTlink written 3.7 years ago by dariober10k
1
gravatar for Devon Ryan
3.7 years ago by
Devon Ryan93k
Freiburg, Germany
Devon Ryan93k wrote:

Load the coverage into R, plot the distribution, and then use awk to filter the files given some reasonable threshold given the coverage distribution.

ADD COMMENTlink written 3.7 years ago by Devon Ryan93k

Thanks @Devon Ryan. I followed your advice and this is the plot I created. X-axis shows the # of reads/counts and Y-axis is the frequency of each count. Is there a way to determine which threshold is 'reasonable'?

ADD REPLYlink written 3.7 years ago by biohack92150
3

You could make a qq-plot, but it's generally OK to just eye-ball things. From the distribution, it looks like a threshold around 200 would be reasonable.

ADD REPLYlink written 3.7 years ago by Devon Ryan93k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1352 users visited in the last hour