File conversion from coordinates to genes
1
0
Entering edit mode
3.1 years ago
storm1907 ▴ 30

Hi, I need to convert chromosomal location to gene name (preferably, HGNC nomenclature) in a txt file with 6 columns: From

1   chr1:183189 0   183189  G   C
1   chr1:609407 0   609407  0   0
1   chr1:609434 0   609434  0   0
1   chr1:609435 0   609435  G   G

to

1   genename    0   183189  G   C
1   genename    0   609407  0   0
1   genename    0   609434  0   0
1   genename    0   609435  G   G

Files are made in GRCH38. I looked in this forum for similar issue, but could not find appropriate solution. One option was with bedtools, but unfortunately, only bed files are suitable in that case. One more option included some outdated R packages, which I could not install.

Thank you!

conversion • 2.3k views
ADD COMMENT
0
Entering edit mode

but unfortunately, only bed files are suitable in that case.

use awk to convert reformat your input. Hint: split($2,a,/[:]/);printf("%s\t%d\t%s\n",a[1],int(a[2])-1,a[2]);}'

ADD REPLY
0
Entering edit mode

Sorry, I tried that syntax, but get error..

ADD REPLY
0
Entering edit mode

Also, where can I get bed files with HGNC gene names? From UCSC Table Browser I can only download table with transcripts

ADD REPLY
0
Entering edit mode

First part of Alex Reynolds answer below gives you that.

ADD REPLY
0
Entering edit mode

My answer can provide multiple HGNC records for one gene that has multiple transcripts. More information would be needed to filter these for canonical transcripts. And one genomic position could overlap different genes. But bedmap --echo-map-id-uniq will remove duplicate gene names, in either case.

ADD REPLY
0
Entering edit mode
3.1 years ago

You could use the BEDOPS kit to solve this.

$ awk -v FS="\t" -v OFS="\t" '{ n=split($2, a, ":"); print a[1], a[2]-1, a[2]; }' original_table.txt \
   | sort-bed - \
   > positions.bed

Get annotations:

$ wget -qO- http://hgdownload.cse.ucsc.edu/goldenpath/hg38/database/refGene.txt.gz \
  | gunzip -c - \
  | awk -v FS="\t" -v OFS="\t" '{ if (!match($13, /.*-[0-9]+/)) { print $3, $5, $6, $13, ".", $4; } }' - \
  | sort-bed - \
  > refGene.hg38.bed

Then map:

$ bedmap --echo --echo-map-id-uniq --delim '\t' --unmapped-val "." positions.bed refGene.hg38.bed > mapped.positions.bed

To put your mapped data into the desired format, you'll need to associate positions with your original file:

#!/usr/bin/env python

import sys

original_table_fn = sys.argv[1]
mapped_positions_fn = sys.argv[2]

''' build position table '''
t = {}
with open(original_table_fn, "r") as ifh:
    for line in ifh:
        elems = line.rstrip().split('\t')
        t[elems[1]] = elems 

''' map genes to position table '''
with open(mapped_positions_fn, "r") as ifh:
    for line in ifh:
        (chrom, start, stop, genes) = line.rstrip().split('\t')
        k = "{}:{}".format(chrom, stop)
        try:
            v = t[k]
            v[1] = genes # swap position key for mapped gene(s)
            sys.stdout.write('{}\n'.format('\t'.join(v)))
        except KeyError:
            sys.stderr.write('Warning: Position {} was not found\n'.format(k))

Usage:

$ ./reverse_map.py original_table.txt mapped.positions.bed > answer.txt
ADD COMMENT
0
Entering edit mode

Thank you! However, when running the last script, I got this error:

 Traceback (most recent call last):
  File "reverse_map.py", line 22, in <module>
    (chrom, start, stop, genes) = line.rstrip().split('\t')
ValueError: not enough values to unpack (expected 4, got 3)

mapped.positions.bed file start is this:

enter image description here

ADD REPLY
0
Entering edit mode

Because you have gene names missing from some of your rows you are getting that error.

ADD REPLY
0
Entering edit mode

Yes, I try to convert empty space to string, but without success:

awk 'BEGIN {OFS=FS="\t"} {gsub(/" "/,"na",$4)}1'  $file > $outfile1
ADD REPLY
0
Entering edit mode

At this moment, this command line helped to me:

awk -F'\t' 'BEGIN{OFS=FS} $4 == "" {$4 = "UNKNOWN"} 1' mapped.positions.bed > outfile
ADD REPLY
0
Entering edit mode

I forgot about the --unmapped-val operator. Using it would save you the awk step in your comment.

So in the current version of bedmap, you could also do the following:

$ bedmap --echo --echo-map-id-uniq --delim '\t' --unmapped-val "." positions.bed refGene.hg38.bed > mapped.positions.bed

You could replace . with UNKNOWN or na or whatever else is convenient.

See bedmap --help for a full listing of options.

ADD REPLY
0
Entering edit mode

Hello, also I was wondering, if it possible to write gene names together with SNP position as well in the 2nd column?

ADD REPLY
0
Entering edit mode

You could pipe the output of bedmap to awk and rewrite columns there:

$ bedmap ... | awk -v FS="\t" -v OFS="\t" '{ ... }' > answer.bed

Replace ... as needed.

Or, modify the Python script above to print out desired columns. It just depends on what you want, where.

ADD REPLY
0
Entering edit mode

Sorry, I am very new to programming. Is there any reference file also with NG positions, not only chromosomal positions? So that output could be:

1   genename:snp_postion_in_gene     0   183189  G   C 
1   genename:snp_postion_in_gene     0   609407  0   0
1   genename:snp_postion_in_gene     0   609434  0   0
1   genename:snp_postion_in_gene     0   609435  G   G
ADD REPLY

Login before adding your answer.

Traffic: 2620 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6