Transcripts - Coding Or Ncrna ?
3
1
Entering edit mode
10.7 years ago
HBR ▴ 10

Hi First of all thanks for looking into my post. I was wondering if I have the transcript IDs extracted from the cufflinks output (transcripts.gtf), is it possible to find whether the transcripts are coding or non-coding and if they are ncRNA, if they are pseudogenes or snoRNA or miRNA or other types of ncRNA.

Is there any online bioinformatics tool available to detect the biotype of transcripts.

These are few of my sample transcript IDs :

ENST00000492842
ENST00000492842
ENST00000410691
ENST00000410691
ENST00000493797
ENST00000493797
ENST00000493797
ENST00000496488
ENST00000496488
ENST00000496488
ENST00000458203
ENST00000458203

Any help in this regard will be highly appreciated. Thanks in advance.

• 3.8k views
ADD COMMENT
6
Entering edit mode
10.7 years ago
Rm 8.3k

GTF file used for cufflinks should provide you with the "biotype" information

For example: Ensemble Homo_sapiens GRCh37 71 GTF file

11      snRNA   exon    10420739        10420864        .       -       .        gene_id "ENSG00000221574"; transcript_id "ENST00000408647"; exon_number "1"; gene_name "U6atac"; gene_biotype "snRNA"; transcript_name "U6atac.23-201"; exon_id "ENSE00001565282";
11      snoRNA  exon    10823014        10823155        .       -       .        gene_id "ENSG00000238622"; transcript_id "ENST00000459187"; exon_number "1"; gene_name "SNORD97"; gene_biotype "snoRNA"; transcript_name "SNORD97-201"; exon_id "ENSE00001806941";

Than map each transcript or gene with its biotype.

quick awk script: save it as get.gtf.ensg.biotypes.awk

BEGIN {OFS=FS="\t"}
(substr($1,1,1)!="#" && substr($1,2,1)!="#") {
        split($9,format,";");
        i=0;
           for (i in format){
                if (format[i] ~ /gene_biotype|gene_type/){
                  gsub("gene_biotype | gene_type ", "", format[i]);
                  gsub(/"/,"",format[i]);
                  gsub(/gene_id "/,"",format[1]);
                  gsub(/transcript_id "/,"",format[2]);
                  gsub(/"/,"",format[1]);
                  gsub(/"/,"",format[2]);
                        #print format[1] "\t" format[i];
                        print format[1] "\t" format[2] "\t" format[i];
                }
            }
        }

Run it:

awk -f get.gtf.ensg.biotypes.awk Homo_sapiens.GRCh37.71.gtf

Sample out put:

 ENSG00000210049         ENST00000387314         Mt_tRNA
 ENSG00000211459         ENST00000389680         Mt_rRNA
 ENSG00000210077         ENST00000387342         Mt_tRNA
 ENSG00000210082         ENST00000387347         Mt_rRNA
 ENSG00000209082         ENST00000386347         Mt_tRNA
 ENSG00000198888         ENST00000361390         protein_coding
 ENSG00000198888         ENST00000361390         protein_coding
 ENSG00000210100         ENST00000387365         Mt_tRNA
 ENSG00000210107         ENST00000387372         Mt_tRNA
 ENSG00000210112         ENST00000387377         Mt_tRNA
 ENSG00000198763         ENST00000361453         protein_coding
 ENSG00000198763         ENST00000361453         protein_coding
.....
ADD COMMENT
1
Entering edit mode
10.7 years ago
jxchong ▴ 160

Those are all Ensembl transcript IDs. You can look up all the information about them on Ensembl Genome Browser. Example: http://uswest.ensembl.org/Homo_sapiens/Search/Details?db=core;end=1;idx=Gene;q=ENST00000492842;species=Homo_sapiens

ADD COMMENT
0
Entering edit mode
10.7 years ago
HBR ▴ 10

Thanks Rm and jxchong for your replies. I really appreciate it.

The awk script is awesome - though I have a question - in the 9th column of the GTF file, do we gene biotype or transcript biotype.

Pardon my ignorance - is it possible for transcripts to have different biotype from its corresponding gene,

Like for example - if Gene abc is miRNA - and say if it has three transcripts, can the biotype for these 3 transcripts be miRNA, protein_coding and snoRNA or do you think all the three transcripts will be miRNA only ? Please suggest.

Thanks once again for your time on this.

ADD COMMENT
1
Entering edit mode

Yes, a gene can have both coding and non coding transcripts. It's unlikely that there will be three different "active" transcript types from a single gene, but you can certainly get, for example, protein_coding and processed_transcript. For example: http://www.ensembl.org/Homo_sapiens/Gene/Summary?db=core;g=ENSG00000121879;r=3:178865902-178957881

This gene has four transcripts, three of which are coding and one is a retained intron.

ADD REPLY
0
Entering edit mode

Okay, I got it. Thanks so much Emily_Ensembl.

ADD REPLY
0
Entering edit mode

@HBR; Biotype resented will corresponding its feature type in that line; can be gene or transcript. FYI: please do reply as comment NOT as an Answer, unless you are answering your question...

ADD REPLY
0
Entering edit mode

Rm - sorry about that, I will make sure to reply. Thanks

ADD REPLY

Login before adding your answer.

Traffic: 1513 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6