Hey guys !
I'm actually working on a CNV project for the first time and need your help.
I mapped on a ref genome 44 Ilumina pack coming from 44 different individues. I have extract the number of read mapped on each genic feature and try to normalize all of this mess.
I applied the following formula, found in a publication :
NormalizedData = (Number of reads in feature * genome size) / (Total number of read in sample * read size)
They told then :
< 0.05 = Gene isnt present
< 0.5 = deletion
> 1.5 = duplication
What do you guys think of this technique ? I read a lot of possibilities on the net, but what do you usually do with this kind of data ? I heard a fiew words about LDA and tried it on R with MASS package but I got some issues ... And i'm not so good at statistics, so ..
I want to maximize the differences and "group" my values arround thresholds. But I think weak the way I actually do.
Thanks a lot
There are some problems with this approach. Why not avoid reinventing the wheel by using a preexisting package? If you search pubmed you'll find more than a few.
QDNAseq was published last year and might be of interest for CNV detection.
http://genome.cshlp.org/content/24/12/2022.full
I searched a lot and found differents normalization formulas with differents thresholds ..
And I should have told you that I work on really large genome and wasn't able to keep all my bam files, I just extracted want I want (especially the number of read mapped in each gene).
For sure I heard about this kind of soft, but I'm looking for a formulas able to normalize this depth of coverage instead of remapping everything, piped with samtools and a bedfile. I know I should do this, but .. beginning.
You'd be better off to just remap things. A word of advice: when starting a new type of analysis that you've never done before, research what the current best practices are beforehand. This will save you some time.
Also, your genome isn't going to be much larger than what the rest of us work on. Even if you have some high-ploidy plant with high coverage, storage is cheap.
Hello!
It appears that your post has been cross-posted to another site: http://seqanswers.com/forums/showthread.php?t=51110
This is typically not recommended as it runs the risk of annoying people in both communities.
It now annoys people when looking for help in 2 different communities? K then for following times.
Thanks zhangchao3, I'll have a look at your first link, it seem to accept pileup as input!
The "annoy" part is from an autogenerated template. For questions like this that end up being about best-practices and methods comparisons, cross-posting isn't a problem. But it's always best to have a link to the other posts so people who see the question later and are interested in all of the replies can find them.