Question

From gene symbol list to their coordinates

0

Entering edit mode

9.4 years ago

cfarmeri ▴ 210

Hi, Biostars. I have a gene symbol list (plain text file) as following example.

=============

Nanog

Sox2

Dnmt3l

Rex2

・

=============

From this, I would like to get these gene's coordinate as following.

=============

Nanog,Chr6:122707489-122714633,+

Sox2,Chr3:34650405-34652461,+

Dnmt3l,Chr10:78041947-78063622,+

Rex2,Chr4:147021850-147060794,+

・

=============

what the most simple way of you know solves above ?

Could you please suggest me anything and everything!!

gene symbol coordinate • 9.0k views

ADD COMMENT • link updated 9.3 years ago by MAPK ★ 2.1k • written 9.4 years ago by cfarmeri ▴ 210

Emily · Answer 1 · 2016-03-07

4

Entering edit mode

9.4 years ago

venu 7.1k

One way is using biomaRt package from Bioconductor. If you don't want to use the this method, the other way is

Download appropriate reference genome GTF file from Ensembl
Convert your Gene symbols to Ensembl ids (Id Conversion online)
Extract Gene coordinates from GTF file for your genes.

Both methods require little bit of programming knowledge (R knowledge for first method and some simple unix commands for second). I think online tool Ensembl Biomart provide this information, though I've never used it.

ADD COMMENT • link updated 9.4 years ago by Emily 24k • written 9.4 years ago by venu 7.1k

3

Entering edit mode

The BioMart online tool is probably the easiest way to do it. There's a video tutorial to get you started. Just filter by your list of gene names (ID list: Associated gene name) and get the coordinates as attributes.

ADD REPLY • link 9.4 years ago by Emily 24k

0

Entering edit mode

Thanks Emily, that is indeed very simple. Do you know how the gene start and end position is defined? Transcription start/end, or perhaps coding start/end?

ADD REPLY • link 9.4 years ago by rbagnall ★ 1.8k

1

Entering edit mode

5' transcription start of the most 5' transcript and 3' transcription end of the most 3' transcript

ADD REPLY • link 9.4 years ago by Emily 24k

0

Entering edit mode

Thanks venu. I solved my problem by biomaRt package!! So thanks !!! The other ways you suggest also look very useful.

ADD REPLY • link 9.3 years ago by cfarmeri ▴ 210

score 0 · Answer 2 · 2016-03-07

Using the UCSC table browser Select the following options:

genome: mouse
assembly: Dec. 2011 (GRCm38/mm10)
group: Genes and Gene predictions
track: RefSeq Genes
table: refGene
region: genome
Identifiers (names/accessions): paste list {then paste the list of gene names}
output format: selected fields from primary and related tables
output file: results.txt

Then click 'get output' and in the following window select:

name
chrom
strand
txStart
txEnd
name2

This gives the following output in results.txt:

chrom strand txStart txEnd name2
chr3 + 34649994 34652460 Sox2
chr4 + 147021849 147060799 Rex2
chr6 + 122707564 122714633 Nanog
chr6 + 122707564 122714633 Nanog
chr6 + 122707488 122714633 Nanog
chr6 + 122707564 122714633 Nanog
chr10 + 78042286 78063622 Dnmt3l
chr10 + 78049958 78063622 Dnmt3l
chr10 + 78055334 78063622 Dnmt3l
chr10 + 78055334 78063622 Dnmt3l
chr10 + 78055334 78063622 Dnmt3l
chr10 + 78049841 78063622 Dnmt3l

Change to the required format with:

awk '{print$5","$1":"$3"-"$4","$2}' results.txt | uniq

This gives:

Sox2,chr3:34649994-34652460,+
Rex2,chr4:147021849-147060799,+
Nanog,chr6:122707564-122714633,+
Nanog,chr6:122707488-122714633,+
Nanog,chr6:122707564-122714633,+
Dnmt3l,chr10:78042286-78063622,+
Dnmt3l,chr10:78049958-78063622,+
Dnmt3l,chr10:78055334-78063622,+
Dnmt3l,chr10:78049841-78063622,+

Why are there multiple rows for some genes? Because these genes have more than one transcript

score 0 · Answer 3 · 2016-03-14

There is this tool in Rpackage called: org.Hs.egREFSEQ2EG (https://www.bioconductor.org/packages/3.3/data/annotation/manuals/org.Hs.eg.db/man/org.Hs.eg.db.pdf)

I always use this for various conversions. I have this snippet of R code for you to play around:

 x <- org.Hs.egREFSEQ2EG
    # Get the RefSeq identifier that are mapped to an entrez gene ID
    mapped_seqs <- mappedkeys(x)
    # Convert to a list
    xx <- as.list(x[mapped_seqs])
    if(length(xx) > 0) {
      # Get the entrez gene for the first five Refseqs
      xx[1:5]
      # Get the first one
      xx[[1]]
    }

#Now convert this to dataframe
mydb.refseq<-as.data.frame(x)