Question: From gene symbol list to their coordinates
0
gravatar for cfarmeri
3.4 years ago by
cfarmeri150
Japan
cfarmeri150 wrote:

Hi, Biostars. I have a gene symbol list (plain text file) as following example.

=============

Nanog

Sox2

Dnmt3l

Rex2

=============

From this, I would like to get these gene's coordinate as following.

=============

Nanog,Chr6:122707489-122714633,+

Sox2,Chr3:34650405-34652461,+

Dnmt3l,Chr10:78041947-78063622,+

Rex2,Chr4:147021850-147060794,+

=============

what the most simple way of you know solves above ?

Could you please suggest me anything and everything!!

coordinate gene symbol • 1.7k views
ADD COMMENTlink modified 3.4 years ago by MAPK1.4k • written 3.4 years ago by cfarmeri150
3
gravatar for venu
3.4 years ago by
venu6.2k
Germany
venu6.2k wrote:

One way is using biomaRt package from Bioconductor. If you don't want to use the this method, the other way is

Both methods require little bit of programming knowledge (R knowledge for first method and some simple unix commands for second). I think online tool Ensembl Biomart provide this information, though I've never used it.

ADD COMMENTlink modified 3.4 years ago by Emily_Ensembl18k • written 3.4 years ago by venu6.2k
2

The BioMart online tool is probably the easiest way to do it. There's a video tutorial to get you started. Just filter by your list of gene names (ID list: Associated gene name) and get the coordinates as attributes.

ADD REPLYlink written 3.4 years ago by Emily_Ensembl18k

Thanks Emily, that is indeed very simple. Do you know how the gene start and end position is defined? Transcription start/end, or perhaps coding start/end?

ADD REPLYlink written 3.4 years ago by rbagnall1.4k
1

5' transcription start of the most 5' transcript and 3' transcription end of the most 3' transcript

ADD REPLYlink written 3.4 years ago by Emily_Ensembl18k

Thanks venu. I solved my problem by biomaRt package!! So thanks !!! The other ways you suggest also look very useful.

ADD REPLYlink written 3.4 years ago by cfarmeri150
0
gravatar for rbagnall
3.4 years ago by
rbagnall1.4k
Australia
rbagnall1.4k wrote:

Using the UCSC table browser Select the following options:

genome: mouse
assembly: Dec. 2011 (GRCm38/mm10)
group: Genes and Gene predictions
track: RefSeq Genes
table: refGene
region: genome
Identifiers (names/accessions): paste list {then paste the list of gene names}
output format: selected fields from primary and related tables
output file: results.txt

Then click 'get output' and in the following window select:

name
chrom
strand
txStart
txEnd
name2

This gives the following output in results.txt:

chrom strand txStart txEnd name2
chr3 + 34649994 34652460 Sox2
chr4 + 147021849 147060799 Rex2
chr6 + 122707564 122714633 Nanog
chr6 + 122707564 122714633 Nanog
chr6 + 122707488 122714633 Nanog
chr6 + 122707564 122714633 Nanog
chr10 + 78042286 78063622 Dnmt3l
chr10 + 78049958 78063622 Dnmt3l
chr10 + 78055334 78063622 Dnmt3l
chr10 + 78055334 78063622 Dnmt3l
chr10 + 78055334 78063622 Dnmt3l
chr10 + 78049841 78063622 Dnmt3l

Change to the required format with:

awk '{print$5","$1":"$3"-"$4","$2}' results.txt | uniq

This gives:

Sox2,chr3:34649994-34652460,+
Rex2,chr4:147021849-147060799,+
Nanog,chr6:122707564-122714633,+
Nanog,chr6:122707488-122714633,+
Nanog,chr6:122707564-122714633,+
Dnmt3l,chr10:78042286-78063622,+
Dnmt3l,chr10:78049958-78063622,+
Dnmt3l,chr10:78055334-78063622,+
Dnmt3l,chr10:78049841-78063622,+

Why are there multiple rows for some genes? Because these genes have more than one transcript

ADD COMMENTlink written 3.4 years ago by rbagnall1.4k

rbagnall , thanks. My genes are UCSC gene symbols, so your suggestion is very suitable.

ADD REPLYlink written 3.4 years ago by cfarmeri150
0
gravatar for MAPK
3.4 years ago by
MAPK1.4k
United States
MAPK1.4k wrote:

There is this tool in Rpackage called: org.Hs.egREFSEQ2EG (https://www.bioconductor.org/packages/3.3/data/annotation/manuals/org.Hs.eg.db/man/org.Hs.eg.db.pdf)

I always use this for various conversions. I have this snippet of R code for you to play around:

 x <- org.Hs.egREFSEQ2EG
    # Get the RefSeq identifier that are mapped to an entrez gene ID
    mapped_seqs <- mappedkeys(x)
    # Convert to a list
    xx <- as.list(x[mapped_seqs])
    if(length(xx) > 0) {
      # Get the entrez gene for the first five Refseqs
      xx[1:5]
      # Get the first one
      xx[[1]]
    }

#Now convert this to dataframe
mydb.refseq<-as.data.frame(x)
ADD COMMENTlink modified 3.4 years ago • written 3.4 years ago by MAPK1.4k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1528 users visited in the last hour