how to add a pseduo gene into a GTF file?
3
0
Entering edit mode
6.9 years ago
super ▴ 60

Hi all

When do RNA-Seq analysis, after I have finished the alignment by Tophat and get the accepted_hits.bam. Now I want to looked at if it is has differential expression at the corresponding area (We called it 'pseudogene' which haven't been annotated by reference genome). How could I do it?

I think the best way is to add this pseudogene to reference genome GTF file. And then I could play HTSeq-count or Cuffdiff to look at the differential expression of this gene .

Here is the effort I made so far . I have add the Chr21:100000-200000 as 'exon' (a pseudo gene) to the genes.gtf of Ensembl, rename the GTF file.

$head genes_hello.gtf 29 protein_coding exon 28820825 28834944 . + . exon_id "ENSECAE00000000001"; exon_number "1"; gene_biotype "protein_coding"; gene_id "ENSECAG00000099999"; gene_name "HELLO"; gene_source "ensembl"; p_id "P99999"; transcript_id "ENSECAT00000099999"; transcript_name "HELLO-001"; transcript_source "ensembl"; tss_id "TSS9999";** 1 protein_coding UTR 11193 11209 . + . gene_biotype "protein_coding"; gene_id "ENSECAG00000012421"; gene_name "SYCE1"; gene_source "ensembl"; p_id "P20975"; transcript_id "ENSECAT00000013004"; transcript_name "SYCE1-201"; transcript_source "ensembl"; tss_id "TSS1013"; 1 protein_coding exon 11193 11261 . + . exon_id "ENSECAE00000079002"; exon_number "1"; gene_biotype "protein_coding"; gene_id "ENSECAG00000012421"; gene_name "SYCE1"; gene_source "ensembl"; p_id "P20975"; transcript_id "ENSECAT00000013004"; transcript_name "SYCE1-201"; transcript_source "ensembl"; tss_id "TSS1013"; 1 protein_coding transcript 11193 15975 . + . gene_biotype "protein_coding"; gene_id "ENSECAG00000012421"; gene_name "SYCE1"; gene_source "ensembl"; p_id "P20975"; transcript_id "ENSECAT00000013004"; transcript_name "SYCE1-201"; transcript_source "ensembl"; tss_id "TSS1013"; 1 protein_coding CDS 11210 11261 . + 0 exon_number "1"; gene_biotype "protein_coding"; gene_id "ENSECAG00000012421"; gene_name "SYCE1"; gene_source "ensembl"; p_id "P20975"; protein_id "ENSECAP00000010287"; transcript_id "ENSECAT00000013004"; transcript_name "SYCE1-201"; transcript_source "ensembl"; tss_id "TSS1013";  However, if I run the Cuffdiff directly, the output (i.e. gene_exp.diff) didn't contain any my pseudo gene information. And if I ran HTSeq-count, it gave me the error: $ htseq-count -s yes aligned_sorted.sam genes_hello.gtf>gene_count.txt

Error occured when processing GFF file (line 1 of file genes.gtf):
need more than 1 value to unpack
[Exception type: ValueError, raised in __init__.py:207]


It seems that I didn't add the pseudogene properly. Does anyone know how to do it?

Thank you!

RNA-Seq gtf genome • 4.0k views
0
Entering edit mode

hi,

A GTF contains for each gene a hierarchy of information as: gene, transcript, exon. Probably you need to enter two more lines stating as "gene" and a "transcript" in Col 3 of GTF for your "pseudo-gene".

0
Entering edit mode

Cheers, I remember I have added not only exon but also CDS,UTR, sth. but I failed aw well.

So I only keep the exon in the gtf file and post this question.

But the problem must be the GTF file format.

0
Entering edit mode
6.9 years ago
BioProg • 0

Do you mean that your GTF file contains the pseudogene annotations but the cufflinks output does not have the "biotype/function" of the transcript and gene IDs? if so, I can share with you a script to extract that. I had the same problem last week.

0
Entering edit mode

I wrote the pseudo gene annotation into the GTF file (as I pasted in the question post).

I want to look at the differential expression of this pseudo gene.

But when I ran Cuffdiff and it didn't contain this pseudo gene in output file (gene_exp.diff)

and I ran HTSEQ-count, and I got the error.

Does your script work? Thank you so much!

0
Entering edit mode
6.9 years ago
Martombo ★ 3.0k

is your GFF tab-separated? I think that's a compulsory requirement.

line 207 (in my version 0.6.1) is the following:

( seqname, source, feature, start, end, score, strand, frame, attributeStr ) = line.split( "\t", 8 )

which means that the program cannot split your string with a tab delimiter in 9 fields

0
Entering edit mode

Hi Martombo

I think so. I have edited my post and pasted the original GTF file.

Do you have any ideas? Thank you!

0
Entering edit mode
5.6 years ago

was this resolved? I am also trying to add a pseudogene to my GTF file but when I run cuffdiff the pseudogene is not in the gene_exp.diff output.

0
Entering edit mode

When you add a new gene, make sure you add an entry for gene, transcript, and exon features (column 3). In the original post, there was only an exon.