Filtering genes from cuffdiff results
1
0
Entering edit mode
19 months ago
sujaypatil • 0

I have run cuffdiff (with statistics turned ON) to compare two groups of samples: Control group and Late AD group.

This is the command I ran to be precise:

cuffdiff -L Control,AD_Late_Braak -p 8 --total-hits-norm --frag-bias-correct ../References/ensembl.GRCh38.99.fa --multi-read-correct --library-norm-method quartile ../References/Homo_sapiens.GRCh38.99.chr.gtf Early_Braak_Control_1,Early_Braak_Control_2,Early_Braak_Control_3 Late_Braak_Sample_1,Late_Braak_Sample_2,Late_Braak_Sample_3


I'm looking at the output from cuffdiff and I see a gene_exp.diff file which contains the results of the differential expression testing. I want to know what is the best way to filter the results of this gene_exp.diff file so as to restrict the number of genes that are up-regulated and down-regulated to a list between ~50-850.

P.S My thoughts are to tweak the p-values and log2 fold change values, but seems like a trail-and-error method, so I was wondering if there was a more "formal" method/approach?

Thanks!

RNA-Seq tophat2 cuffdiff • 518 views
0
Entering edit mode

Hello sujaypatil!

It appears that your post has been cross-posted to another site: https://bioinformatics.stackexchange.com/questions/11741/filtering-genes-from-cuffdiff-results

This is typically not recommended as it runs the risk of annoying people in both communities.

1
Entering edit mode
19 months ago
ATpoint 55k

First of all I would abandon tophat and cufflinks since both methods are now considered deprecated. I would switch to a quantification tool such as salmon or kallisto followed by differential analysis with something like DESeq2 and edgeR. Both the latter tools have options to test against a certain fold change (instead of the default test against 0) which allow to reduce the number of DGEs to those that are probably the most biologically-meaningful. This is recommended by the edgeR authors if you feel that you have "too many genes" and want to filter in a data-driven fashion without tweaking the p-values too much. Effectively this means that only genes with higher FCs will be retained. In edgeR the function is called glmTreat, for the DESeq2 analogon please check the documentation. Still I think there is no formal way of obtaining exactly <int> DGEs since this is not how DGE analysis works. It only tells you how many genes at the given depth, number of replicates are significantly different from the expectation which again is based on the underlying model.

0
Entering edit mode

Thanks a tonne for the recommendation! I will keep in mind, the salmon / kallisto + DESeq2 / edgeR package stack in mind for future analyses. However, as part of a college assignment we've had to use the Tuxedo suite of tools for DE analysis, and so I've gone ahead and run the analysis using cuffdiff for now. I understand that there is a package called CummeRbund which also helps filter out the most significant DGEs. Great! I understand, thanks for the help! Per your recommendation I will experiment with edgeR.

0
Entering edit mode

I've moved ATPoint's comment to an answer. If it was helpful, you should upvote it; if it resolved your question, you should mark it as accepted. You can accept more than one answer if they all work.