Question: Cufflinks transcriptome (To build or not to build?)
0
gravatar for Morris_Chair
5 months ago by
Morris_Chair120
Morris_Chair120 wrote:

Dear All,

I’m starting using cufflinks after the TopHat2 alignment. Looking at the manual I can see that there are several options to add to this tool, but in particular what is not clear to me is if I have to use the trascriptome or the genome as reference. I run the program using this command line

 cufflinks -p 4 -N -g Homo_sapiens.GRCh37.75.gtf file.bam -o cuffresults

As you can see, I used the reference genome annotation,is tat correct? I’m also wondering if I can use a trascriptome annotation downloaded from ensemble (Homo_sapiens.GRCh38.cdna.all.fa.gz) or I have to build it from the reference genome ?

Thank you

rna-seq • 212 views
ADD COMMENTlink modified 5 months ago by WouterDeCoster40k • written 5 months ago by Morris_Chair120
1

Please use the "question" post type if you are asking questions, and not "Forum". I have converted your post for you this time (and your previous posts...), but please keep this in mind for further posts.

ADD REPLYlink written 5 months ago by WouterDeCoster40k
1

You should probably not be using TopHat2 - see this tweet from Lior Pachter one of the creators of TopHat and Cufflinks.

ADD REPLYlink modified 5 months ago • written 5 months ago by kristoffer.vittingseerup2.2k

Hi Kristoffer, thank you for the indicated racomandation, at the moment my aim is to be more familiar with codes and commands used in tools for RNA seq analysi. I'm taking in consideration also HISAT and to be honest currently in my winning list there is featureCounts.

ADD REPLYlink written 5 months ago by Morris_Chair120

Even if you just want gene-expression I would recommend doing the transcription level quantification - it gives more accurate gene-level estimates - see this blog. For considerations regarding transcription level quantification check this section of my vignette.

Btw even if you decide on usingstrong text featureCounts you still need files mapped to the genome - so the Hisat run is still needed :-)

ADD REPLYlink written 5 months ago by kristoffer.vittingseerup2.2k
1
gravatar for Kevin Blighe
5 months ago by
Kevin Blighe46k
Kevin Blighe46k wrote:

For Cufflinks, you do not have to supply a FASTA reference genome or transcriptome. The GTF file is an annotation file that details the co-ordinates of exons/CDS, UTRs, etc.

Also, Homo_sapiens.GRCh37.75.gtf and Homo_sapiens.GRCh38.cdna.all.fa.gz relate to different genome builds; so, definitely do not use these together.

Perhaps a simple check of the help page will enlighten you - I have pulled out the parameters that may be of interest to you:

  -G/--GTF                     quantitate against reference transcript annotations                      
  -g/--GTF-guide               use reference transcript annotation to guide assembly                   
  -b/--frag-bias-correct       use bias correction - reference fasta required        [ default:   NULL ]
ADD COMMENTlink modified 5 months ago • written 5 months ago by Kevin Blighe46k

Hi Kevin, I have a couple of more questions, What would you use as reference transcript annotation? where did you get your reference fasta required for the bias correction? -b

Thank you for help

ADD REPLYlink written 5 months ago by Morris_Chair120
1

On the reference genome, I imagine it should be the same as the one to which you originally aligned your reads via TopHat2 / Bowtie2. If you have already performed the TopHat2 step, then you should have a reference genome FASTA file?

The GTF can really be anything. I have used custom GTFs in the past where I had, for example, a suspected novel long non-coding RNA in breast cancer. I edited the GTF and added my own entry for it. Most would take the GTFs from GENCODE: https://www.gencodegenes.org/

ADD REPLYlink written 5 months ago by Kevin Blighe46k

Hello Kevin, I have another question, when my supercomputer will create with cufflinks the transcript.gtf of all my files, I'll have to use cuffmerge for the assembly.

The command linke is

cuffmerge -g gene.gtf -s genome.fa -p 8 assemlies.txt

I don't understand, why do I have to use the reference genome.GTF plus the genome.fa fasta version? It looks to me that I'm using twice the same thing

thank you

ADD REPLYlink written 5 months ago by Morris_Chair120
1

Hey, you have the option of choosing whichever GTF you wish. The TopHat2 / Cufflinks pipeline can involve some going back and forward. For example, you may want to first create a transcriptome GTF from all of your samples (using cuffmerge), but guided by the current reference transcript annotation (e.g., from GENCODE). Once you have then defined your custom transcriptome GTF, you can go back and perform count abundance over its transcripts with cufflinks via the -G/--GTF command line parameter.

By the way, the upgraded versions of TopHat2 / Cufflinks are HISAT2 / StringTie.

ADD REPLYlink modified 5 months ago • written 5 months ago by Kevin Blighe46k

thank you Kevin, I think I will stop the run and start to work with HISAT ... I'm sorry cause was almost half of the way after running for two days...

thanks again

ADD REPLYlink written 5 months ago by Morris_Chair120

Not to worry. Nothing that we do in life is ever time wasted.

ADD REPLYlink written 5 months ago by Kevin Blighe46k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1034 users visited in the last hour