I am a newbie to RNA-seq data analysis. I have to identify differentially expressed genes (DEGs) between human and chimpanzee in a tissue type. I have comparable RNA-seq experiment data (reads/fastq) for the two species. Each species has 2 biological replicates(each with three technical replicates) so six runs per species.
I understand that identification of DEGs by cufflink package (cuffdiff) is for two conditions with same reference genome. To identify DEGs between different species, I have to use edgeR or DEseq.
I intend to identify FPKM values for all genes in case of all 12 runs (6 runs per species) and then to use this FPKM dataset to identify DEGs with R package (EdgeR or Deseq). Is this approach okay?
Second, my main question is about fpkm values I am getting in cufflink output. For running cufflink, I am following the step-by-step protocol mentioned in the cufflink protocol paper (https://www.nature.com/articles/nprot.2012.016).
First I ran tophat with following command:
tophat -p 8 -G hg38.ncbiRefSeq.gtf -o Human_B1_T1 hg38 SRRxxx_1.fastq SRRxxx_2.fastq
Then I ran cufflink as below:
cufflinks -p 8 -o Clout_Human_B1_T1 Human_B1_T1/accepted_hits.bam
The 'genes.fpkm_tracking' file I got in cufflink output has first few lines as below:
tracking_id class_code nearest_ref_id gene_id gene_short_name tss_id locus length coverage FPKM FPKM_conf_lo FPKM_conf_hi FPKM_status
CUFF.1 - - CUFF.1 - - chr1:151793-152723 - - 1.57259 0.924969 2.22021 OK
CUFF.2 - - CUFF.2 - - chr1:153030-158982 - - 0.924186 0 6.23538e+06 OK
CUFF.3 - - CUFF.3 - - chr1:633736-634228 - - 12.1477 9.07784 15.2175 OK
If someone please tell what CUFF.1 CUFF.2 (and so on) means. Other than 1st (tracking id) column, the same thing is present in the 4th (gene_id) column as well. How can I get FPKM values along with gene names? There are no gene names in this file.
I found this (https://biostar.usegalaxy.org/p/17760/) as a relevant post but couldn't find clear answer there.
PS: For the hg38 genes.gtf file, I used the file 'hg38.ncbiRefSeq.gtf' downloaded from UCSC portal.
Correct. Even see the latest release notes from the tophat team themselves: https://ccb.jhu.edu/software/tophat/index.shtml
Thank you. Cufflink was already installed in the system so I proceeded with that. I will look into the other latest programs as well. If you can please also tell what these codes in cufflink output means and how to get cufflink output showing genes' names with fpkm?
cufflinks is no longer maintained and people don't use it anymore so it's difficult to find someone to help you. Furthermore, you can not use cufflinks output of FPKMs for edgeR/deseq2 which use negative binomial regression to perform differential expression (FPKMs don't follow negative binomial distribution). So, stop using cufflinks (the only useful thing about cufflinks now is managing assemblies [e.g. from Trinity]). For what it's worth, I actually currently work in the lab where cufflinks was first developed. All of us use kallisto now.
Just install kallisto (or similar software) on your system (installing locally is fine, if you don't have superuser access).
Got it. Thank you so much for the very helping and detailed reply.