how to add a pseduo gene into a GTF file?
3
0
Entering edit mode
6.9 years ago
super ▴ 60

Hi all

When do RNA-Seq analysis, after I have finished the alignment by Tophat and get the accepted_hits.bam. Now I want to looked at if it is has differential expression at the corresponding area (We called it 'pseudogene' which haven't been annotated by reference genome). How could I do it?

I think the best way is to add this pseudogene to reference genome GTF file. And then I could play HTSeq-count or Cuffdiff to look at the differential expression of this gene .

Here is the effort I made so far . I have add the Chr21:100000-200000 as 'exon' (a pseudo gene) to the genes.gtf of Ensembl, rename the GTF file.

$ head genes_hello.gtf
29       protein_coding  exon    28820825   28834944   .       +       .       exon_id "ENSECAE00000000001"; exon_number "1"; gene_biotype "protein_coding"; gene_id "ENSECAG00000099999"; gene_name "HELLO"; gene_source "ensembl"; p_id "P99999"; transcript_id "ENSECAT00000099999"; transcript_name "HELLO-001"; transcript_source "ensembl"; tss_id "TSS9999";**
1       protein_coding  UTR     11193   11209   .       +       .       gene_biotype "protein_coding"; gene_id "ENSECAG00000012421"; gene_name "SYCE1"; gene_source "ensembl"; p_id "P20975"; transcript_id "ENSECAT00000013004"; transcript_name "SYCE1-201"; transcript_source "ensembl"; tss_id "TSS1013";
1       protein_coding  exon    11193   11261   .       +       .       exon_id "ENSECAE00000079002"; exon_number "1"; gene_biotype "protein_coding"; gene_id "ENSECAG00000012421"; gene_name "SYCE1"; gene_source "ensembl"; p_id "P20975"; transcript_id "ENSECAT00000013004"; transcript_name "SYCE1-201"; transcript_source "ensembl"; tss_id "TSS1013";
1       protein_coding  transcript      11193   15975   .       +       .       gene_biotype "protein_coding"; gene_id "ENSECAG00000012421"; gene_name "SYCE1"; gene_source "ensembl"; p_id "P20975"; transcript_id "ENSECAT00000013004"; transcript_name "SYCE1-201"; transcript_source "ensembl"; tss_id "TSS1013";
1       protein_coding  CDS     11210   11261   .       +       0       exon_number "1"; gene_biotype "protein_coding"; gene_id "ENSECAG00000012421"; gene_name "SYCE1"; gene_source "ensembl"; p_id "P20975"; protein_id "ENSECAP00000010287"; transcript_id "ENSECAT00000013004"; transcript_name "SYCE1-201"; transcript_source "ensembl"; tss_id "TSS1013";

However, if I run the Cuffdiff directly, the output (i.e. gene_exp.diff) didn't contain any my pseudo gene information. And if I ran HTSeq-count, it gave me the error:

$ htseq-count -s yes aligned_sorted.sam genes_hello.gtf>gene_count.txt

Error occured when processing GFF file (line 1 of file genes.gtf):
  need more than 1 value to unpack
  [Exception type: ValueError, raised in __init__.py:207]

It seems that I didn't add the pseudogene properly. Does anyone know how to do it?

Thank you!

RNA-Seq gtf genome • 4.0k views
ADD COMMENT
0
Entering edit mode

hi,

A GTF contains for each gene a hierarchy of information as: gene, transcript, exon. Probably you need to enter two more lines stating as "gene" and a "transcript" in Col 3 of GTF for your "pseudo-gene".

ADD REPLY
0
Entering edit mode

Cheers, I remember I have added not only exon but also CDS,UTR, sth. but I failed aw well.

So I only keep the exon in the gtf file and post this question.

But the problem must be the GTF file format.

ADD REPLY
0
Entering edit mode
6.9 years ago
BioProg • 0

Do you mean that your GTF file contains the pseudogene annotations but the cufflinks output does not have the "biotype/function" of the transcript and gene IDs? if so, I can share with you a script to extract that. I had the same problem last week.

ADD COMMENT
0
Entering edit mode

I wrote the pseudo gene annotation into the GTF file (as I pasted in the question post).

I want to look at the differential expression of this pseudo gene.

But when I ran Cuffdiff and it didn't contain this pseudo gene in output file (gene_exp.diff)

and I ran HTSEQ-count, and I got the error.

Does your script work? Thank you so much!

ADD REPLY
0
Entering edit mode
6.9 years ago
Martombo ★ 3.0k

is your GFF tab-separated? I think that's a compulsory requirement.

line 207 (in my version 0.6.1) is the following:

( seqname, source, feature, start, end, score, strand, frame, attributeStr ) = line.split( "\t", 8 )

which means that the program cannot split your string with a tab delimiter in 9 fields

ADD COMMENT
0
Entering edit mode

Hi Martombo

I think so. I have edited my post and pasted the original GTF file.

Do you have any ideas? Thank you!

ADD REPLY
0
Entering edit mode
5.6 years ago

was this resolved? I am also trying to add a pseudogene to my GTF file but when I run cuffdiff the pseudogene is not in the gene_exp.diff output.

ADD COMMENT
0
Entering edit mode

When you add a new gene, make sure you add an entry for gene, transcript, and exon features (column 3). In the original post, there was only an exon.

ADD REPLY

Login before adding your answer.

Traffic: 1551 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6