Question

Normalization reads between samples for aneuploidy detection

0

Entering edit mode

9.3 years ago

Paul ★ 1.5k

Dear biostar users,

I would like to detect aneuploidy between samples - I have reference (healthy sample) and aneuploidy samples...

After run my workflow I have on output:

bins no_reads GC_content

So I have each chromosome split into the bins and calculated in each bins number of reads and cg_content for each bin. Do you have any idea how to normalize number of reads to apply some statistical test for comparing data to each other?

I was thinking to apply analysis from RNA-seq data - sequencing depth (RPGC) = (total number of mapped reads * fragment length) / effective genome size (analogy to RPKM) - but it is not probably good idea..

Thank you for any comment and sharing experience..

reads gc content normalization aneuploidy • 3.0k views

ADD COMMENT • link updated 2.1 years ago by Ram 43k • written 9.3 years ago by Paul ★ 1.5k

Ram · Answer 1 · 2015-01-12

1

Entering edit mode

9.3 years ago

Irsan ★ 7.8k

So you want to identify somatic (maybe somatic is not the right word for your application but I mean case vs matched control) copy number alteration? Is it for exome sequencing or whole genome sequencing or even from the RNA seq data you mentioned? In case of the latter i would use normalized expression estimates from edgeR and calculate the Log(expression ratio case-control). Then vizualize these numbers along the genome to see if it is going anywhere. If not, I would use 200Kbps bins in stead of per gene data and see if that is better. It is very important important that you blacklist parts of the reference genome where the ploidy can not be accurately quantified by depth of coverage. For genome/exome you can use the GEM-library for that. If you are using exome sequencing it is important to identify baits that perform poorly. Then, you have to correct for GC content by loess regression. In stead of providing you the code to do so, I recommend you to use tools that do this for you like ControlFREEC or QDNAseq

ADD COMMENT • link 9.3 years ago by Irsan ★ 7.8k

0

Entering edit mode

Thank you Irsan for your comment.. I have whole genome sequencing (single end data - 35 bp read length). I am creating my own pipeline for detection aneuploidy. I have bins prepared and calculating gc content and number of reads... Yeas I was thinking to use Loess Regression to normalization my data and then Z-score to find differences between reference and aneuploidy sample.

ADD REPLY • link updated 2.1 years ago by Ram 43k • written 9.3 years ago by Paul ★ 1.5k

0

Entering edit mode

Hi paul, ok you decide to implement copy number detection yourself. It gives flexibility when you keep things in your own hands in stead of relying on tools. I still think it is very useful to define low mappability regions with the gem library and to omit those in your analyses. What do you mean with z-scores? Will you use a z-score of a particular bin compared to all bins within a sample? Or a z-score of your bin compared to the same bin in other samples? The standard statistic to use for copy number estimates is log2(signal case/signal control). I am unsure if all statistical assumptions are valid when using z-scores for your purpose. BTW, if you have sufficient coverage (at polymorphic positions) you can do allele specific copy number calling/LOH analysis. Good luck!

ADD REPLY • link 9.3 years ago by Irsan ★ 7.8k