Question

How to had CDS feature with cufflinks (cuffmerge) using custom GTF?

0

Entering edit mode

9.3 years ago

A. Domingues ★ 2.7k

I am trying to do differential gene expression analysis, and analysis of 3'UTR usage, in a knock-down experiment in Zebrafish (RNA-seq, PE). Since I am interested in 3'UTR processing, I am using the GTF with improved 3'UTR annotation provided in Junker et al. Here is what it looks like (it also contains "regular" chr seqnames):

track name="tomo-seq extended gene models"
Zv9_NA250       Ensembl exon    3       727     .       -       .       gene_id "ENSDARG00000091021"; transcript_id "ENSDART00000124805"; exon_number "2"; gene_biotype "protein_coding"; gene_name "CABZ01001466.1"; transcript_name "CABZ01001466.1-201"
Zv9_NA250       Ensembl exon    5774    5858    .       -       .       gene_id "ENSDARG00000091021"; transcript_id "ENSDART00000124805"; exon_number "1"; gene_biotype "protein_coding"; gene_name "CABZ01001466.1"; transcript_name "CABZ01001466.1-201"
Zv9_NA250       Ensembl transcript      3       5858    .       -       .       gene_id "ENSDARG00000091021"; transcript_id "ENSDART00000124805"; gene_biotype "protein_coding"; gene_name "CABZ01001466.1"; transcript_name "CABZ01001466.1-201"
Zv9_NA619       Ensembl exon    2       678     .       -       .       gene_id "ENSDARG00000073818"; transcript_id "ENSDART00000111264"; exon_number "2"; gene_biotype "protein_coding"; gene_name "CABZ01031998.1"; transcript_name "CABZ01031998.1-201"
Zv9_NA619       Ensembl exon    4863    4947    .       -       .       gene_id "ENSDARG00000073818"; transcript_id "ENSDART00000111264"; exon_number "1"; gene_biotype "protein_coding"; gene_name "CABZ01031998.1"; transcript_name "CABZ01031998.1-201"
Zv9_NA619       Ensembl transcript      2       4947    .       -       .       gene_id "ENSDARG00000073818"; transcript_id "ENSDART00000111264"; gene_biotype "protein_coding"; gene_name "CABZ01031998.1"; transcript_name "CABZ01031998.1-201"
Zv9_NA979       Ensembl exon    353     499     .       -       .       gene_id "ENSDARG00000089935"; transcript_id "ENSDART00000123414"; exon_number "2"; gene_biotype "protein_coding"; gene_name ""; transcript_name ""
Zv9_NA979       Ensembl exon    720     854     .       -       .       gene_id "ENSDARG00000089935"; transcript_id "ENSDART00000123414"; exon_number "1"; gene_biotype "protein_coding"; gene_name ""; transcript_name ""
Zv9_NA979       Ensembl transcript      21      854     .       -       .       gene_id "ENSDARG00000089935"; transcript_id "ENSDART00000123414"; gene_biotype "protein_coding"; gene_name ""; transcript_name ""

for the sake of consistency, the Tuxedo pipeline was used:

mapping to Zv9 with Tophat;
followed by RABT cufflinks2 on each sample to find new transcripts;
cuffmerge to generated a new reference from all the transcripts.gtf and the Junker.gtf
cuddfiff2 for differential expression.

The problem is that the results of differential CDS use are missing form the cuffdiff output (empty file). Tracked down the issue, and it stems from the lack of "p_id" in the merged.gtf from cuffmerge:

chr1    Cufflinks       exon    1       635     .       +       .       gene_id "XLOC_001204"; transcript_id "TCONS_00001298"; exon_number "1"; oId "CUFF.38.1"; class_code "u"; tss_id "TSS1231";
chr1    Cufflinks       exon    3762    3930    .       +       .       gene_id "XLOC_001205"; transcript_id "TCONS_00001300"; exon_number "1"; gene_name "F7 (3 of 3)"; oId "CUFF.40.2"; nearest_ref "ENSDART00000109479"; class_code "j"; tss_id "TSS1232";
chr1    Cufflinks       exon    4646    4809    .       +       .       gene_id "XLOC_001205"; transcript_id "TCONS_00001300"; exon_number "2"; gene_name "F7 (3 of 3)"; oId "CUFF.40.2"; nearest_ref "ENSDART00000109479"; class_code "j"; tss_id "TSS1232";
chr1    Cufflinks       exon    5178    5202    .       +       .       gene_id "XLOC_001205"; transcript_id "TCONS_00001300"; exon_number "3"; gene_name "F7 (3 of 3)"; oId "CUFF.40.2"; nearest_ref "ENSDART00000109479"; class_code "j"; tss_id "TSS1232";
chr1    Cufflinks       exon    5533    5646    .       +       .       gene_id "XLOC_001205"; transcript_id "TCONS_00001300"; exon_number "4"; gene_name "F7 (3 of 3)"; oId "CUFF.40.2"; nearest_ref "ENSDART00000109479"; class_code "j"; tss_id "TSS1232";
chr1    Cufflinks       exon    6247    6393    .       +       .       gene_id "XLOC_001205"; transcript_id "TCONS_00001300"; exon_number "5"; gene_name "F7 (3 of 3)"; oId "CUFF.40.2"; nearest_ref "ENSDART00000109479"; class_code "j"; tss_id "TSS1232";
chr1    Cufflinks       exon    11050   11141   .       +       .       gene_id "XLOC_001205"; transcript_id "TCONS_00001300"; exon_number "6"; gene_name "F7 (3 of 3)"; oId "CUFF.40.2"; nearest_ref "ENSDART00000109479"; class_code "j"; tss_id "TSS1232";
chr1    Cufflinks       exon    11240   11360   .       +       .       gene_id "XLOC_001205"; transcript_id "TCONS_00001300"; exon_number "7"; gene_name "F7 (3 of 3)"; oId "CUFF.40.2"; nearest_ref "ENSDART00000109479"; class_code "j"; tss_id "TSS1232";
chr1    Cufflinks       exon    13073   15178   .       +       .       gene_id "XLOC_001205"; transcript_id "TCONS_00001300"; exon_number "8"; gene_name "F7 (3 of 3)"; oId "CUFF.40.2"; nearest_ref "ENSDART00000109479"; class_code "j"; tss_id "TSS1232";
chr1    Cufflinks       exon    3762    3930    .       +       .       gene_id "XLOC_001205"; transcript_id "TCONS_00001299"; exon_number "1"; gene_name "F7 (3 of 3)"; oId "CUFF.40.1"; nearest_ref "ENSDART00000109479"; class_code "j"; tss_id "TSS1232";

This, despite the fact that cuffmerge was ran with the option -s fasta. Also tried these cuffdiff's instructions:

Note: If an arbitrary GTF/GFF3 file is used as input (instead of the .combined.gtf file produced by Cuffcompare), these attributes will not be present, but Cuffcompare can still be used to obtain these attributes with a command like this:
cuffcompare -s /path/to/genome_seqs.fa -CG -r annotation.gtf annotation.gtf

But p_id is still missing. This is an issue because the CDS information is quite critical for my analysis.

Does anyone have an idea of how to solve this? Is this because I am using a non UCSC/Ensembl annotation?

tophat tomoseq cuffcompare cuffmerge cufflinks • 4.4k views

ADD COMMENT • link updated 2.1 years ago by Ram 43k • written 9.3 years ago by A. Domingues ★ 2.7k

0

Entering edit mode

If you don't have annotated coding sequences in the annotation file then it's impossible to assign protein IDs (p_id). You need to add those entries first and then you'll be able to proceed. I'm not familiar with any tools to do that, though they presumably exist (maybe in GMOD?). If nothing else, you could always write something (use biopython/bioperl and find the longest ORF in each transcript), though that wouldn't be a trivial thing to write.

N.B., posting as a comment to not dissuade someone that knows of a simple solution from answering.

ADD REPLY • link 9.3 years ago by Devon Ryan 104k

0

Entering edit mode

That is what I feared. Just occurred to me that maybe an Ensembl GTF could be used during the merge: it contains the p_id (or cds info); and Junker et al used one to base their gene models on. Cheers.

ADD REPLY • link 9.3 years ago by A. Domingues ★ 2.7k

1

Entering edit mode

Just make sure to change the chromosome names, since Ensembl doesn't use the "chr" prefix.

ADD REPLY • link 9.3 years ago by Devon Ryan 104k

Ram · Accepted Answer · 2015-01-21

I managed to do it in a round about way:

1. Using cuffcompare, add cufflink 'tags' to the reference GTF, from ensembl (wiht added chr and UCSC compatible chromosome names):

cuffcompare -s $fasta -CG -r $gtfFile $gtfFile
mv cuffcmp.combined.gtf Annotation/Genes/Danio_rerio.Zv9.74.chrom.TR.cuffcmp.combined.gtf

This adds the p_id and tss_ids to the gtf, plus some other information requested for full cufflinks analysis.

2. With cuffmerge, create a new gtf, merging ensembl with Junker. This, since the ensembl GTF has the p_id, the output also has it:

cuffmerge -p 12 --ref-gtf /home/adomingu/imb-kettinggr/genomes/Danio_rerio/Ensembl/Zv9/Annotation/Genes/Danio_rerio.Zv9.74.chrom.TR.gtf -s /home/adomingu/imb-kettinggr/genomes/Danio_rerio/Ensembl/Zv9/Sequence/chr_sequences/chr.clean.fa -o cuffmerge_junker assembly_list.txt

Basically, all it is needed is an annotation that has the p_id, and cuffmerge will take of the rest.