Question

finding function of genes from their coordinates

0

Entering edit mode

6.2 years ago

Ana ▴ 200

Hi everyone,

I made some Manhattan plots and I can see some very tall peaks on chromosome 6 and 12. I already have the coordinates of the gens (chromosome, starting and ending positions), I want to know the function. I am new in this and does anyone have any idea where should I start?

gene annotation • 1.5k views

ADD COMMENT • link 6.2 years ago by Ana ▴ 200

score 1 · Answer 1 · 2018-01-17

One idea is to:

Map genomic positions ("coordinates") to gene name annotations using set operation tools like BEDOPS bedmap
Convert gene name annotations to a format that can be used with gene ontology ("GO") tools, such as EMBL format, in order to work with EMBL's QuickGO tool
Run those converted gene names through the GO tool-of-choice to get functional annotations

As a more concrete example of this workflow, you could convert a sort-bed sorted BED file of coordinates of interest (intervals.bed) to a text file of HGNC symbols/names with BEDOPS convert2bed and bedmap.

This uses Gencode annotations that are hg38; make adjustments depending on your genome:

$ sort-bed intervals.unsorted.bed > intervals.bed

$ wget -qO- ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_21/gencode.v21.annotation.gff3.gz \
    | gunzip --stdout - \
    | awk '$3 == "gene"' \
    | convert2bed -i gff - \
    > gencodeV21.genes.bed

$ paste <(cut -f1-3 gencodeV21.genes.bed) <(grep -Po 'gene_name=\K[^ ;]+' gencodeV21.genes.bed) > gencodeV21.HGNCgenes.bed

$ bedmap --echo-map-id --multidelim '\n' --skip-unmapped intervals.bed gencodeV21.HGNCgenes.bed > genes.txt

(Note: If you're on OS X, use Homebrew to install GNU grep and replace grep with ggrep.)

If you have a text file of HGNC symbols/names on each line, you can convert them to EMBL names with the following Python script:

#!/usr/bin/env python

import sys
import mygene

mg = mygene.MyGeneInfo()

genes = []
for line in sys.stdin:
    genes.append(line.strip())

results = mg.querymany(genes, scopes="symbol", fields=["uniprot"], species="human", verbose=False)

for res in results:
    name = res["query"]
    if "uniprot" in res and "TrEMBL" in res["uniprot"]:
        recs = res["uniprot"]["TrEMBL"]
        if isinstance(recs, basestring):
            sys.stdout.write("%s\t%s\n" % (name, recs))
        else:
            for rec in recs:
                sys.stdout.write("%s\t%s\n" % (name, rec))

For example, if you start with a text file called genes.txt:

DDX26B
CCDC83
MAST3
RPL11
ZDHHC20
LUC7L3
SNORD49A
CTSH
ACOT8

Then you can run the script on the genes to map them to EMBL equivalents:

$ ./hgnc_to_embl.py < genes.txt > genes_mapping.txt

The resulting file would look something like:

CCDC83  H0YDV3
MAST3   V9GYV0
RPL11   Q5VVC8
RPL11   Q5VVC9
RPL11   Q5VVD0
ZDHHC20 A0A0D9SEN4
ZDHHC20 B4DRN8
LUC7L3  A8K3C5
LUC7L3  C9JL41
LUC7L3  D6RDI2
LUC7L3  D6RHH0
LUC7L3  E7EN55
LUC7L3  H0YA81
LUC7L3  H0YAR4
LUC7L3  H0YAX1
LUC7L3  H0YAY6
LUC7L3  H0YBV7
LUC7L3  H7C5U7
LUC7L3  J3KPP4
LUC7L3  Q86Y74
LUC7L3  U3KQT3
CTSH    A0A087X0D5
CTSH    A0A0B4J217
CTSH    E9PKT6
CTSH    E9PN60
CTSH    E9PN84
ACOT8   E9PIS4
ACOT8   E9PJN0
ACOT8   E9PMC4
ACOT8   E9PRD4
ACOT8   F6VBM3
ACOT8   H0Y698
ACOT8   H7C5A7
ACOT8   Q9BR14

(Note that some HGNC names do not map to EMBL names. Consider how this issue might affect any statistical tests you do with GO annotations.)

To get a text file containing just the EMBL names:

$ cut -f2 genes_mapping.txt > embl.txt

Once you have those EMBL names, you can submit them to QuickGO and export the results:

Visit: http://www.ebi.ac.uk/QuickGO/
Click on the "View GO Annotations" panel
Click on the "Gene Product ID" button in the toolbar
Copy and paste the names from embl.txt into the text field (or paste in your result set from running things locally with your genes of interest)
Click the "Add" button
Click the "Apply" button to perform the search
Click the "Export" button to export the search results

score 0 · Answer 2 · 2018-01-17

0

Entering edit mode

6.2 years ago

pfs ▴ 280

I would start here https://www.ncbi.nlm.nih.gov/protein/. If you have a manhattan plot of SNPs then you may want to annotate the SNPs using an annotation program such ANNOVAR to determine the what your SNP may actually be effecting.

ADD COMMENT • link 6.2 years ago by pfs ▴ 280