Question

genes that map to hg38 coordinates

1

Entering edit mode

8.5 years ago

bioguy24 ▴ 230

Given a list of ~55,000 hg38 coordinates is there an easy way to get the gene that maps to it?I guess it is running genome browser in batch. Maybe there is a mysql command that will use the input? Thank you :).

input

 chr1:1013574-1013576
 chr1:1013984-1014478
 chr1:1020163-1020383

desired output

  chr1:1013574-1013576 ISG15
  chr1:1013984-1014478 ISG15
  chr1:1020163-1020383 AGRN

ngs • 5.9k views

ADD COMMENT • link 8.5 years ago by bioguy24 ▴ 230

0

Entering edit mode

Thank you all for the great solutions :).

ADD REPLY • link 8.5 years ago by bioguy24 ▴ 230

2

Entering edit mode

8.5 years ago

Charles Plessy ★ 2.9k

You can use R and Bioconductor's GenomicRanges package as follows:

First, represent your regions of interest as genomic ranges. For real data, you may load a BED file with the function rtracklayer::import.bed(). Here, we create by hand the three regions that you gave as example. The output of the commands is displayed in italics.

# In R, load the library.
library(GenomicRanges)

# Create an GRanges object (here called, 'r'). that contains the regions of interest.
r <- c( GRanges("chr1", IRanges(1013574,1013576))
      , GRanges("chr1", IRanges(1013984,1014478))
      , GRanges("chr1", IRanges(1020163,1020383)))
r

Output:

GRanges object with 3 ranges and 0 metadata columns: seqnames ranges strand <Rle> <IRanges> <Rle> [1] chr1 [1013574, 1013576] * [2] chr1 [1013984, 1014478] * [3] chr1 [1020163, 1020383] * ------- seqinfo: 1 sequence from an unspecified genome; no seqlengths

Then, load genomic ranges for an annotation, for instance GENCODE.

# Get GENCODE from the Bioconductor's Annotation Hub.
library(AnnotationHub)
ah <- AnnotationHub()
query(ah, c("Gencode", "gff", "human"))

Output:

AnnotationHub with 9 records # snapshotDate(): 2016-12-29 # $dataprovider: Gencode # $species: Homo sapiens # $rdataclass: GRanges # additional mcols(): taxonomyid, genome, description, tags, # sourceurl, sourcetype # retrieve records with, e.g., 'object[["AH49554"]]' title AH49554 | gencode.v23.2wayconspseudos.gff3.gz AH49555 | gencode.v23.annotation.gff3.gz AH49556 | gencode.v23.basic.annotation.gff3.gz AH49557 | gencode.v23.chr_patch_hapl_scaff.annotation.gff3.gz AH49558 | gencode.v23.chr_patch_hapl_scaff.basic.annotation.gff3.gz AH49559 | gencode.v23.long_noncoding_RNAs.gff3.gz AH49560 | gencode.v23.polyAs.gff3.gz AH49561 | gencode.v23.primary_assembly.annotation.gff3.gz AH49562 | gencode.v23.tRNAs.gff3.gz

The command above returned that the ID for GENCODE 23 is AH49556 and indicated which command to run next.

ah["AH49556"]
gc <- ah[["AH49556"]]

Then, intersect your regions of interest with the annotation.

overlaps <- findOverlaps(r, gc)

This returns an object telling which region overlaps with which annotation. The last step is to extract the gene names, collate them, and prepare your final output.

genes <- extractList(gc$gene_name, as(overlaps, "List"))
genes <- unstrsplit(unique(genes), ";") # Needed in case more than one gene overlaps.
paste(as.character(r), genes)

[1] "chr1:1013574-1013576 ISG15" "chr1:1013984-1014478 ISG15" [3] "chr1:1020163-1020383 AGRN"

You can find other examples on how to use findOverlaps() in other Biostars posts or in Bioconductor's support forum.

ADD COMMENT • link 8.5 years ago by Charles Plessy ★ 2.9k

1

Entering edit mode

8.5 years ago

newbiebio ▴ 80

Maybe you could try bedtools. First, go to ensembl and download refgenome's gtf file. Then use 'bedtools intersect -wo -a ref.gtf -b you_input.bed"

ADD COMMENT • link 8.5 years ago by newbiebio ▴ 80

score 4 · Accepted Answer · 2017-01-03

Another way is to download the HGNC names to a BED file, and use BEDOPS bedmap to map gene IDs to intervals.

First, download gene annotations and convert them to a sorted, four-column BED file:

$ curl -s "http://hgdownload.cse.ucsc.edu/goldenPath/hg38/database/refGene.txt.gz" | gunzip -c | cut -f3,5,6,13 | sort-bed - > genes.bed

Second, convert the intervals from your format into sorted BED and pipe that to bedmap, along with the genes.bed file created in the first step. The hyphen (-) in the bedmap call is a placeholder for standard input, which is data that is coming out of sort-bed upstream in the pipeline:

$ sed $'s/[:-]/\t/g' intervals.txt | sort-bed - | bedmap --echo --echo-map-id-uniq - genes.bed > answer.bed

The file answer.bed will be BED-formatted, but that can be easily munged back into your desired format by piping the output to awk:

$ sed $'s/[:-]/\t/g' intervals.txt | sort-bed - | bedmap --echo --echo-map-id-uniq - genes.bed | awk '{print $1":"$2"-"$3"\t"$4;}' > answer.txt