CNV by read_depth approach
1
0
Entering edit mode
9.7 years ago
Anon • 0

Hey guys !

I'm actually working on a CNV project for the first time and need your help.

I mapped on a ref genome 44 Ilumina pack coming from 44 different individues. I have extract the number of read mapped on each genic feature and try to normalize all of this mess.

I applied the following formula, found in a publication :

NormalizedData = (Number of reads in feature * genome size) / (Total number of read in sample * read size)

They told then :
< 0.05 = Gene isnt present
< 0.5 = deletion
> 1.5 = duplication

What do you guys think of this technique ? I read a lot of possibilities on the net, but what do you usually do with this kind of data ? I heard a fiew words about LDA and tried it on R with MASS package but I got some issues ... And i'm not so good at statistics, so ..

I want to maximize the differences and "group" my values arround thresholds. But I think weak the way I actually do.

Thanks a lot

R CNV normalization depth • 2.8k views
ADD COMMENT
1
Entering edit mode

There are some problems with this approach. Why not avoid reinventing the wheel by using a preexisting package? If you search pubmed you'll find more than a few.

ADD REPLY
0
Entering edit mode

QDNAseq was published last year and might be of interest for CNV detection.

http://genome.cshlp.org/content/24/12/2022.full

ADD REPLY
0
Entering edit mode

I searched a lot and found differents normalization formulas with differents thresholds ..

And I should have told you that I work on really large genome and wasn't able to keep all my bam files, I just extracted want I want (especially the number of read mapped in each gene).

For sure I heard about this kind of soft, but I'm looking for a formulas able to normalize this depth of coverage instead of remapping everything, piped with samtools and a bedfile. I know I should do this, but .. beginning.

ADD REPLY
0
Entering edit mode

You'd be better off to just remap things. A word of advice: when starting a new type of analysis that you've never done before, research what the current best practices are beforehand. This will save you some time.

Also, your genome isn't going to be much larger than what the rest of us work on. Even if you have some high-ploidy plant with high coverage, storage is cheap.

ADD REPLY
0
Entering edit mode

Hello!

It appears that your post has been cross-posted to another site: http://seqanswers.com/forums/showthread.php?t=51110

This is typically not recommended as it runs the risk of annoying people in both communities.

ADD REPLY
0
Entering edit mode

It now annoys people when looking for help in 2 different communities? K then for following times.

Thanks zhangchao3, I'll have a look at your first link, it seem to accept pileup as input!

ADD REPLY
0
Entering edit mode

The "annoy" part is from an autogenerated template. For questions like this that end up being about best-practices and methods comparisons, cross-posting isn't a problem. But it's always best to have a link to the other posts so people who see the question later and are interested in all of the replies can find them.

ADD REPLY
2
Entering edit mode
9.7 years ago
Czh3 ▴ 190

This normalization is not enough, because the GC-content and the mappability influence the number of reads in feature. I think your should try http://bioinfo-out.curie.fr/projects/freec/tutorial.html or http://sv.gersteinlab.org/cnvnator/

ADD COMMENT

Login before adding your answer.

Traffic: 825 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6