How One Can Separate Noncoding From Coding Ucsc And Ensembl Transcripts
2
0
Entering edit mode
7.9 years ago
biorepine ★ 1.5k

Dear biostars,

Do you know how one can separate noncoding from coding UCSC and ENSEMBL transcripts ? In general I use NR_* to identify noncoding and NM_* to identify protein coding genes in Refseq database.

Thanx in advance

code ucsc ensembl refseq • 5.2k views
ADD COMMENT
0
Entering edit mode

what is your input ? a list of knownGene identifiers ? a list of ENSGxxxxxxx ?

ADD REPLY
0
Entering edit mode

yes ENS* in case of ENSEMBL and ucsc.* in case of UCSC.

ADD REPLY
0
Entering edit mode

ucsc.* ? can you give one example please.

ADD REPLY
3
Entering edit mode
7.9 years ago

for the ucsc/knownGene, you can select the transcripts having cdsStart==cdsEnd

$ mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -D hg19 -e 'select name,chrom,cdsStart,cdsEnd from knownGene where cdsStart=cdsEnd limit 10'
+------------+-------+----------+--------+
| name       | chrom | cdsStart | cdsEnd |
+------------+-------+----------+--------+
| uc001aaa.3 | chr1  |    11873 |  11873 |
| uc010nxr.1 | chr1  |    11873 |  11873 |
| uc009vis.3 | chr1  |    14361 |  14361 |
| uc009vit.3 | chr1  |    14361 |  14361 |
| uc009viu.3 | chr1  |    14361 |  14361 |
| uc001aae.4 | chr1  |    14361 |  14361 |
| uc001aah.4 | chr1  |    14361 |  14361 |
| uc009vir.3 | chr1  |    14361 |  14361 |
| uc009viq.3 | chr1  |    14361 |  14361 |
| uc001aac.4 | chr1  |    14361 |  14361 |
+------------+-------+----------+--------+
ADD COMMENT
0
Entering edit mode

So if I change cdsStart!=cdsEnd, does it print only coding genes ? Thanks

ADD REPLY
2
Entering edit mode
7.9 years ago

You can download ENSEMBL annotation from Biomart (http://useast.ensembl.org/biomart/martview/) , you can select Gene Biotype information that will tell you if a given transcript is protein-coding or non-coding.

ADD COMMENT
0
Entering edit mode

Thanx but any idea regarding UCSC transcripts ?

ADD REPLY
1
Entering edit mode

You can input UCSC IDs into BioMart.

There's a help video on BioMart here:

<iframe></iframe>

ADD REPLY
0
Entering edit mode

Using Ensembl biomart, is it possible to find gene biotype on the opposite(antisense) strands, especially if it is coding and non-coding.

ADD REPLY
0
Entering edit mode

I'm afraid I don't understand your question. Are you looking to find out if there's a gene on the opposite strand of your gene of interest and find out what its biotype is? If so, there isn't a way to do that using BioMart. That would be a job for the Perl API.

ADD REPLY
0
Entering edit mode

I am sorry that my question was not clear. But still you got it right - yes I am indeed interesting in looking on the non coding genes on the opposite strand of my gene of interest. I will look into Perl API. Thanks again.

ADD REPLY
0
Entering edit mode
ADD REPLY
0
Entering edit mode

Great !!! thanks so much. Currently i am trying to see the reverse strand information from the Blat output, if i can't then i have to switch to perl api.

ADD REPLY
0
Entering edit mode

I think hundreds of ENSEMBL lincRNAs annotations were wrong. (They should be intergenic and in principle they should not overlap with any known coding transcript irrespective of strand direction)

ex:

chr8    33998976    34060498    NM_001177589_Gm3985    0    -    chr8    33998977    34060498    lincRNA_ENSMUSG00000079070_ENSMUST00000132101_Gm3985    0    -
chr8    33998976    34060498    NM_001177589_Gm3985    0    -    chr8    34000947    34052954    lincRNA_ENSMUSG00000079070_ENSMUST00000180220_Gm3985    0    -
chr8    48265402    48437702    proteinCoding_ENSMUSG00000038143_8_Stox2    0    -    chr8    48379626    48531716    lincRNA_ENSMUSG00000097922_ENSMUST00000181417_AC102862.2    0    -
ADD REPLY
0
Entering edit mode

I'm afraid you've got that wrong. lincRNAs can be anywhere in the genome and can overlap coding genes in both directions.

See the wikipedia article on lincRNAs.

ADD REPLY
0
Entering edit mode

Please see the wiki again.

Long intergenic non-coding RNAs (lincRNA) : "Intergenic" refers to long non-coding RNAs that are transcribed from non-coding DNA sequences between protein-coding genes"

ADD REPLY
0
Entering edit mode

Whoops yes. I googled lincRNA for a definition and didn't notice that the wiki page wasn't actually called lincRNA.

The Ensembl definition can be found here:

http://www.ensembl.org/info/docs/genebuild/ncrna.html

We include RNAs that overlap other genes by <35%

ADD REPLY
0
Entering edit mode

Wiki is right. The original definition came from here http://www.ncbi.nlm.nih.gov/pubmed/19182780. May be you ENSEMBL guys need to change the name from lincRNA to lncRNA. :)

ADD REPLY

Login before adding your answer.

Traffic: 1076 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6