Question

Question on GFOLD input

0

Entering edit mode

6.9 years ago

mmccarthy781 ▴ 10

Hey all,

I have RNA-seq data from an experiment done with no replicates. The lab I'm with would like me to perform differential expression analysis with this data. I realize that this is not ideal, but I'm trying to make the best out of a bad situation. I've already told the lab that any output I produce isn't fit for publication, and really can only be used within the lab to base future projects on. From what I've read online, it seems like my best option is to put my read counts through GFOLD, and base everything off of the GFOLD values. The issues is that I am quantifying my reads using alignment free quantification using Kallisot, thus I don't have a SAM/BED file to use as input with gfold count.

Without using gfold count, I wish to make my own input for gfold diff, but I don't know which values are required for accurate calculation of GFOLD values. gfold diff normally takes in a tab delimited csv file with five columns: "GeneSymbol", "GeneName", "Read Count", "Gene exon length", and "RPKM". I have gene level counts from putting Kallisto's read counts through tximport. I can easily create a text file of the correct format which includes both the "GeneSymbol" and "Read Count" columns while leaving the others as NA; furthermore, this input will produce output with GFOLD values. However, I'm currently unsure if I can trust these GFOLD values as being accurate having left the "Gene exon length" and "RPKM" columns empty. Are these other colums required for accurate GFOLD value calculation?

Gfold RNA-Seq • 3.1k views

ADD COMMENT • link 6.9 years ago by mmccarthy781 ▴ 10

score 1 · Answer 1 · 2017-07-03

Not sure how helpful my answer will be to you - I've done the same ugly thing using Trinity and edgeR, there it was pretty straightforward.

First, make a matrix of expression values based on your kallisto output:

https://github.com/trinityrnaseq/trinityrnaseq/wiki/Trinity-Transcript-Quantification#build-transcript-and-gene-expression-matrices

$TRINITY_HOME/util/abundance_estimates_to_matrix.pl --est_method kallisto \
    --out_prefix trans_counts \
    --name_sample_by_basedir \
     sampleA/abundance.tsv \
     sampleB/abundance.tsv

Replace the abundance.tsv files with your kallisto outputs. It assumes that each sample has its own folder, so sampleA is a sample name here.

This will generate several matrices, most importantly trans_counts.counts.matrix, the raw counts. You can use that table via Trinity in edgeR:

https://github.com/trinityrnaseq/trinityrnaseq/wiki/Trinity-Differential-Expression#running-differential-expression-analysis

  $TRINITY_HOME/Analysis/DifferentialExpression/run_DE_analysis.pl \
      --matrix counts.matrix --method edgeR --dispersion 0.1

Play around with a variety of dispersion values, I'd say 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, I bet at around 0.4 or 0.5 the majority of significant p-values will disappear, use the value before everything becomes insignificant for your final dispersion value.

No clue how to do that with GFOLD. I agree with you that it's not ideal and it's good that you raised this issue with your lab, no reviewer will accept this.

score 1 · Answer 2 · 2017-07-03

So for anyone with the same issues as me, I think I've found the answer. I created a csv file with simulated exon lengths and RPKM values for each gene, and compared the returned results when running GFOLD on the new csv file to the old one with missing data. Checking the returned results, it is seen that the GFOLD, E-FDR, and log2fdc values are identical between the two outputs, so you only need a gene identifier and read counts for accurate calculation.