How To Extract Cds And Protein Sequences From Cufflinks Transcripts.Gtf File?
2
6
Entering edit mode
10.8 years ago
Rahul Sharma ▴ 660

Hi,

I am using Tophat2 and Cufflinks for gene/transcript identification. I used reference genome for mapping RNA-Seq reads and later I used Cufflinks to generate the transcripts.gtf file. I generated the transcript sequences using following command:

gffread -w transcripts.fa -g Masked_for_Tophat.fa transcripts.gtf

Since in the Cufflinks transcripts.gtf file, we do not have CDS information so it is not possible to extract the CDS sequences using it. I got one tool TransDecoder which can generate CDS from the input transcript. Does anyone know how to generate CDS/Protein sequences from Cufflinks transcripts.gtf file?

In another analysis I want to train Augustus using this mapping information. For training augustus, I need to have CDS/Protein sequences. Although I used Augustus for gene prediction using intron/exon hints as mentioned here. I would appreciate your suggestions on this.

Best

cufflinks rna-seq cds • 14k views
ADD COMMENT
1
Entering edit mode

Hi @R@hul, on the TransDecoder page, there is a separate section that deals with your exact situation, i.e. converting a cufflinks.gtf file into GFF3, extracting the transcripts, finding the longest ORFs (reported both as CDS and PEP sequences) and then generating a new GFF3 which reports these coding regions in the context of the genome.

Here is a link to the relevant section: Starting from a genome-based transcript structure GTF file

ADD REPLY
0
Entering edit mode

Hi, I would like to know if you have figured out about annotating a transcripts.gtf file generated by cufflinks.

ADD REPLY
2
Entering edit mode
9.4 years ago
wrf ▴ 70

I'm not sure there is a one-step solution to that. The PASA pipeline includes a script to extract transcripts from cufflinks.gtf, called "cufflinks_gtf_genome_to_cdna_fasta.pl"

http://pasapipeline.github.io

CDS/peptides can be generated from the transcripts as suggested above with TransDecoder.

ADD COMMENT
0
Entering edit mode

Thanks, this answer helped me a lot even though my problems was slightly different. Just as a note - the output from this script includes both the transcript_id (TCONS) and the gene_id (XLOC) together in the fasta header from the cufflinks .gtf file.

ADD REPLY
0
Entering edit mode
9.4 years ago
wanziyi89 ▴ 60

Hi,

Can TransDecoder annotate 5" UTR and 3'UTR as well?

regards,

ADD COMMENT

Login before adding your answer.

Traffic: 1974 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6