Question: Coefficient of variation
gravatar for nicoles
2.2 years ago by
nicoles10 wrote:

I am a newb and I come from a background of we lab experience. Recently, we have started doing RNA-Seq. Originally, our bioinformatics core was going to handle analysis and then that person went on sabbatical. I started using Galaxy to analyze our data. My PI has set parameters (based off the literature) before proceeding with GO terms. One of the conditions is only including genes with a CV of less than or equal to 0.5. Can I do this in Galaxy? If not, could some please tell me how I could do so manually.

I went through Tophat, cufflinks, cuffcompare, cuffdiff based off a colleagues recommendation. I also have a separate workflow of htseq-count then DESeq2.

Any help will be greatly appreciated.


ADD COMMENTlink modified 2.2 years ago by Renesh1.8k • written 2.2 years ago by nicoles10
gravatar for Renesh
2.2 years ago by
United States
Renesh1.8k wrote:

The CV calculations are necessary if you want to select stable and consistently expressed genes from your RNA-seq datasets. The CV calculation is very straightforward and involves standard deviation and mean. CV = SD/Mean. The CV will give you the extent of variability in your gene expression dataset. Your PI is telling to include the genes which are stably expressed across replicates/experiments as the CV is low (0.5).

I am not sure Galaxy do basic statistical calculation with the table data. To calculate CV, you can use database like psql or Excel. You can use CV calculations on htseq-count raw data and then proceed to DESeq package. Most of the gene epression packages calculate the dispersion which accounts for CV.

ADD COMMENTlink written 2.2 years ago by Renesh1.8k

Thank you. I'll calculate with the htseq-count. Is it also acceptable to calculate the stdev and mean from the cufflinks FPKM? For my own understanding and further explanation to my PI

ADD REPLYlink written 2.2 years ago by nicoles10

Yes, you can also calculate CV from FPKM. FPKM is also a normalized count.

ADD REPLYlink written 2.2 years ago by Renesh1.8k

I want to extract unstable/inconsistently expressed genes from gene expression data, and I used CV as follow:

SD <- apply(eset_HTA20,1, sd)
CV <- base::sqrt(exp(SD^2)-1)

but I got this unusual result in terms of CV value range:

> summary(CV)
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
 0.04753  0.12946  0.16494  0.20181  0.22925 15.00777

I think CV should not be more than 1, please correct me. Plus, How can I retain the genes which show a high amount of variation in terms of gene expression level? Any idea?

ADD REPLYlink written 5 months ago by Jurat Shahidin80
gravatar for nicoles
2.2 years ago by
nicoles10 wrote:

Thank you for replying Kevin. I am trying to learn bioinformatics for myself and our lab. It is definitely and essential skill to have. With obtaining the raw counts from my RNA-Seq samples from Kallisto, can I then determine differentially expressed genes with DESeq2? Could I use DESeq2 through Galaxy after I obtain the counts in Kallisto? Thanks!

ADD COMMENTlink written 2.2 years ago by nicoles10

I hope that a tool like Galaxy accepts Kallisto-derived counts, or at best a custom matrix of counts. However, if the HT-seq option is already built-into Galaxy, then you should stick to HT-seq. As far as I recall, you'll therefore have to align the reads to produce a BAM file, over which HT-seq counts transcript abundances (Kallisto and other modern tools don't require a BAM alignment).

There is a great thread here for RNA-seq and Galaxy, which you may have already seen:

ADD REPLYlink modified 2.2 years ago • written 2.2 years ago by Kevin Blighe52k

Yes, I did need the BAM files for ht-seq count. As there will be more RNA-seq coming, I would like to know quicker methods of quantification. In the near future I'll find out if Galaxy accepts the Kallisto counts. The tutorial has greatly helped .

ADD REPLYlink written 2.2 years ago by nicoles10

I would suggest instead of relying on Galaxy, you should use HPC/workstation for quicker and customized analysis.

ADD REPLYlink modified 2.2 years ago • written 2.2 years ago by Renesh1.8k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 682 users visited in the last hour