Question

how to add a pseduo gene into a GTF file?

0

Entering edit mode

8.3 years ago

super ▴ 60

Hi all

When do RNA-Seq analysis, after I have finished the alignment by Tophat and get the accepted_hits.bam. Now I want to looked at if it is has differential expression at the corresponding area (We called it 'pseudogene' which haven't been annotated by reference genome). How could I do it?

I think the best way is to add this pseudogene to reference genome GTF file. And then I could play HTSeq-count or Cuffdiff to look at the differential expression of this gene .

Here is the effort I made so far . I have add the Chr21:100000-200000 as 'exon' (a pseudo gene) to the genes.gtf of Ensembl, rename the GTF file.

$ head genes_hello.gtf
29       protein_coding  exon    28820825   28834944   .       +       .       exon_id "ENSECAE00000000001"; exon_number "1"; gene_biotype "protein_coding"; gene_id "ENSECAG00000099999"; gene_name "HELLO"; gene_source "ensembl"; p_id "P99999"; transcript_id "ENSECAT00000099999"; transcript_name "HELLO-001"; transcript_source "ensembl"; tss_id "TSS9999";**
1       protein_coding  UTR     11193   11209   .       +       .       gene_biotype "protein_coding"; gene_id "ENSECAG00000012421"; gene_name "SYCE1"; gene_source "ensembl"; p_id "P20975"; transcript_id "ENSECAT00000013004"; transcript_name "SYCE1-201"; transcript_source "ensembl"; tss_id "TSS1013";
1       protein_coding  exon    11193   11261   .       +       .       exon_id "ENSECAE00000079002"; exon_number "1"; gene_biotype "protein_coding"; gene_id "ENSECAG00000012421"; gene_name "SYCE1"; gene_source "ensembl"; p_id "P20975"; transcript_id "ENSECAT00000013004"; transcript_name "SYCE1-201"; transcript_source "ensembl"; tss_id "TSS1013";
1       protein_coding  transcript      11193   15975   .       +       .       gene_biotype "protein_coding"; gene_id "ENSECAG00000012421"; gene_name "SYCE1"; gene_source "ensembl"; p_id "P20975"; transcript_id "ENSECAT00000013004"; transcript_name "SYCE1-201"; transcript_source "ensembl"; tss_id "TSS1013";
1       protein_coding  CDS     11210   11261   .       +       0       exon_number "1"; gene_biotype "protein_coding"; gene_id "ENSECAG00000012421"; gene_name "SYCE1"; gene_source "ensembl"; p_id "P20975"; protein_id "ENSECAP00000010287"; transcript_id "ENSECAT00000013004"; transcript_name "SYCE1-201"; transcript_source "ensembl"; tss_id "TSS1013";

However, if I run the Cuffdiff directly, the output (i.e. gene_exp.diff) didn't contain any my pseudo gene information. And if I ran HTSeq-count, it gave me the error:

$ htseq-count -s yes aligned_sorted.sam genes_hello.gtf>gene_count.txt

Error occured when processing GFF file (line 1 of file genes.gtf):
  need more than 1 value to unpack
  [Exception type: ValueError, raised in __init__.py:207]

It seems that I didn't add the pseudogene properly. Does anyone know how to do it?

Thank you!

RNA-Seq gtf genome • 4.7k views

ADD COMMENT • link updated 20 months ago by Ram 43k • written 8.3 years ago by super ▴ 60

0

Entering edit mode

hi,

A GTF contains for each gene a hierarchy of information as: gene, transcript, exon. Probably you need to enter two more lines stating as "gene" and a "transcript" in Col 3 of GTF for your "pseudo-gene".

ADD REPLY • link updated 4.3 years ago by Ram 43k • written 8.3 years ago by Amitm ★ 2.2k

0

Entering edit mode

Cheers, I remember I have added not only exon but also CDS,UTR, sth. but I failed aw well.

So I only keep the exon in the gtf file and post this question.

But the problem must be the GTF file format.

ADD REPLY • link updated 4.3 years ago by Ram 43k • written 8.3 years ago by super ▴ 60

Ram · Answer 1 · 2016-01-12

0

Entering edit mode

8.3 years ago

BioProg • 0

Do you mean that your GTF file contains the pseudogene annotations but the cufflinks output does not have the "biotype/function" of the transcript and gene IDs? if so, I can share with you a script to extract that. I had the same problem last week.

ADD COMMENT • link updated 4.3 years ago by Ram 43k • written 8.3 years ago by BioProg • 0

0

Entering edit mode

I wrote the pseudo gene annotation into the GTF file (as I pasted in the question post).

I want to look at the differential expression of this pseudo gene.

But when I ran Cuffdiff and it didn't contain this pseudo gene in output file (gene_exp.diff)

and I ran HTSEQ-count, and I got the error.

Does your script work? Thank you so much!

ADD REPLY • link updated 4.3 years ago by Ram 43k • written 8.3 years ago by super ▴ 60

Ram · Answer 2 · 2016-01-12

0

Entering edit mode

8.3 years ago

Martombo ★ 3.1k

is your GFF tab-separated? I think that's a compulsory requirement.

line 207 (in my version 0.6.1) is the following:

( seqname, source, feature, start, end, score, strand, frame, attributeStr ) = line.split( "\t", 8 )

which means that the program cannot split your string with a tab delimiter in 9 fields

ADD COMMENT • link 8.3 years ago by Martombo ★ 3.1k

0

Entering edit mode

Hi Martombo

I think so. I have edited my post and pasted the original GTF file.

Do you have any ideas? Thank you!

ADD REPLY • link updated 4.3 years ago by Ram 43k • written 8.3 years ago by super ▴ 60

score 0 · Answer 3 · 2017-05-05

0

Entering edit mode

7.0 years ago

olneykimberly • 0

was this resolved? I am also trying to add a pseudogene to my GTF file but when I run cuffdiff the pseudogene is not in the gene_exp.diff output.

ADD COMMENT • link 7.0 years ago by olneykimberly • 0

0

Entering edit mode

When you add a new gene, make sure you add an entry for gene, transcript, and exon features (column 3). In the original post, there was only an exon.

ADD REPLY • link 7.0 years ago by igor 13k