From gene symbol list to their coordinates
3
0
Entering edit mode
8.6 years ago
cfarmeri ▴ 210

Hi, Biostars. I have a gene symbol list (plain text file) as following example.

=============

Nanog

Sox2

Dnmt3l

Rex2

=============

From this, I would like to get these gene's coordinate as following.

=============

Nanog,Chr6:122707489-122714633,+

Sox2,Chr3:34650405-34652461,+

Dnmt3l,Chr10:78041947-78063622,+

Rex2,Chr4:147021850-147060794,+

=============

what the most simple way of you know solves above ?

Could you please suggest me anything and everything!!

gene symbol coordinate • 7.3k views
ADD COMMENT
4
Entering edit mode
8.6 years ago
venu 7.1k

One way is using biomaRt package from Bioconductor. If you don't want to use the this method, the other way is

Both methods require little bit of programming knowledge (R knowledge for first method and some simple unix commands for second). I think online tool Ensembl Biomart provide this information, though I've never used it.

ADD COMMENT
3
Entering edit mode

The BioMart online tool is probably the easiest way to do it. There's a video tutorial to get you started. Just filter by your list of gene names (ID list: Associated gene name) and get the coordinates as attributes.

ADD REPLY
0
Entering edit mode

Thanks Emily, that is indeed very simple. Do you know how the gene start and end position is defined? Transcription start/end, or perhaps coding start/end?

ADD REPLY
1
Entering edit mode

5' transcription start of the most 5' transcript and 3' transcription end of the most 3' transcript

ADD REPLY
0
Entering edit mode

Thanks venu. I solved my problem by biomaRt package!! So thanks !!! The other ways you suggest also look very useful.

ADD REPLY
0
Entering edit mode
8.6 years ago
rbagnall ★ 1.8k

Using the UCSC table browser Select the following options:

genome: mouse
assembly: Dec. 2011 (GRCm38/mm10)
group: Genes and Gene predictions
track: RefSeq Genes
table: refGene
region: genome
Identifiers (names/accessions): paste list {then paste the list of gene names}
output format: selected fields from primary and related tables
output file: results.txt

Then click 'get output' and in the following window select:

name
chrom
strand
txStart
txEnd
name2

This gives the following output in results.txt:

chrom strand txStart txEnd name2
chr3 + 34649994 34652460 Sox2
chr4 + 147021849 147060799 Rex2
chr6 + 122707564 122714633 Nanog
chr6 + 122707564 122714633 Nanog
chr6 + 122707488 122714633 Nanog
chr6 + 122707564 122714633 Nanog
chr10 + 78042286 78063622 Dnmt3l
chr10 + 78049958 78063622 Dnmt3l
chr10 + 78055334 78063622 Dnmt3l
chr10 + 78055334 78063622 Dnmt3l
chr10 + 78055334 78063622 Dnmt3l
chr10 + 78049841 78063622 Dnmt3l

Change to the required format with:

awk '{print$5","$1":"$3"-"$4","$2}' results.txt | uniq

This gives:

Sox2,chr3:34649994-34652460,+
Rex2,chr4:147021849-147060799,+
Nanog,chr6:122707564-122714633,+
Nanog,chr6:122707488-122714633,+
Nanog,chr6:122707564-122714633,+
Dnmt3l,chr10:78042286-78063622,+
Dnmt3l,chr10:78049958-78063622,+
Dnmt3l,chr10:78055334-78063622,+
Dnmt3l,chr10:78049841-78063622,+

Why are there multiple rows for some genes? Because these genes have more than one transcript

ADD COMMENT
0
Entering edit mode

rbagnall , thanks. My genes are UCSC gene symbols, so your suggestion is very suitable.

ADD REPLY
0
Entering edit mode
8.6 years ago
MAPK ★ 2.1k

There is this tool in Rpackage called: org.Hs.egREFSEQ2EG (https://www.bioconductor.org/packages/3.3/data/annotation/manuals/org.Hs.eg.db/man/org.Hs.eg.db.pdf)

I always use this for various conversions. I have this snippet of R code for you to play around:

 x <- org.Hs.egREFSEQ2EG
    # Get the RefSeq identifier that are mapped to an entrez gene ID
    mapped_seqs <- mappedkeys(x)
    # Convert to a list
    xx <- as.list(x[mapped_seqs])
    if(length(xx) > 0) {
      # Get the entrez gene for the first five Refseqs
      xx[1:5]
      # Get the first one
      xx[[1]]
    }

#Now convert this to dataframe
mydb.refseq<-as.data.frame(x)
ADD COMMENT

Login before adding your answer.

Traffic: 1892 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6