Question

How can I get Transcript ID from the gene ID?

0

Entering edit mode

6.7 years ago

Carlos Caicedo ▴ 210

Dear all

I have a list of gene IDs in a tabular format. How I can extract the transcript IDs for the list of genes IDs mentioned above, from a gff file?

Thank you so much.

Carlos

RNA-Seq • 4.8k views

ADD COMMENT • link updated 6.7 years ago by Jeffin Rockey ★ 1.3k • written 6.7 years ago by Carlos Caicedo ▴ 210

0

Entering edit mode

Depends on what genome this is but you could try BioMart tool from Ensembl.

ADD REPLY • link 6.7 years ago by GenoMax 141k

0

Entering edit mode

I have a data from a bacterium specie, so I think BioMart does not function in this case.

ADD REPLY • link 6.7 years ago by Carlos Caicedo ▴ 210

0

Entering edit mode

If this is a bacterium then you should have a single transcript from each gene since there is no alternate splicing, right?

ADD REPLY • link 6.7 years ago by GenoMax 141k

0

Entering edit mode

Of course, you are right. I going to try to do a better explanation of my question.

A gff file is something like this:

chromosome  ena gene    661 1041    .   -   .   ID=gene:SCLAV_0001;biotype=protein_coding;description=Hypothetical protein;gene_id=SCLAV_0001;logic_name=ena;version=1
chromosome  ena transcript  661 1041    .   -   .   ID=transcript:EFG05077;Parent=gene:SCLAV_0001;biotype=protein_coding;transcript_id=EFG05077;version=1
chromosome  ena exon    661 1041    .   -   .   Parent=transcript:EFG05077;Name=EFG05077-1;constitutive=1;ensembl_end_phase=0;ensembl_phase=0;exon_id=EFG05077-1;rank=1;version=1
chromosome  ena CDS 661 1041    .   -   0   ID=CDS:EFG05077;Parent=transcript:EFG05077;protein_id=EFG05077

I have a list with ID:gene

SCLAV_0001
SCLAV_0002

And I need to get for each gene in the list the transcript ID

For instance:

SCLAV_0001   EFG05077
SCLAV_0002  EFG0XXY

And so on.

ADD REPLY • link updated 6.7 years ago by GenoMax 141k • written 6.7 years ago by Carlos Caicedo ▴ 210

score 1 · Answer 1 · 2017-08-25

Hope the below one liner helps or at least, indicates the way to go ..

awk<yourGeneModel.gff3 -F'\t' '$3=="transcript" {print$9}'  | sed -e 's|ID=transcript:\([^;]*\)\(.*\)Parent=gene:\([^;]*\)\(.*\)|\2\t\1|g'

If it is one gene one transcript for the genemodel, this should do. Else one more script to combine multiple transcripts per gene would be required.