From where to get a comprehensive list of genes with gene start, gene end and chromosome for build 37?
2
0
Entering edit mode
6 months ago
Star ▴ 50

Hi all,

I am trying to annotate list of genes with gene start, gene end (build37) and chromosome. I mapped most of the genes from a list downloaded from Biomart/UCSC, but still have 25 genes those are missing from the list. For example PRAG1, CCL4L2 etc etc. I found one link containing these genes http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/refGene.txt.gz but when I tried to map b37 positions of some random genes (e.g., , 93854920 - 93954309), it does not map with coordinates as in UCSC. Is there any work around? My goal is simple but it turns out to be more complicated than I expected :(

Any leads would be much appreciated.

genome build37 genes R • 409 views
2
Entering edit mode
6 months ago

You could get genes from Gencode: https://www.gencodegenes.org/human/release_39lift37.html

Then convert them from GFF to BED, pulling out the desired gene name:

$wget -qO- https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_39/GRCh37_mapping/gencode.v39lift37.annotation.gff3.gz | gunzip -c | awk '($3=="gene")' | convert2bed -i gff --attribute-key="gene_name" - > genes.bed


Passing --attribute-key="gene_name" to convert2bed -i gff will retrieve the HGNC symbol from the GFF file (PRAG1, CCL4L2, etc.) where available, and place this in the ID field of the resulting BED file. If the HGNC symbol is not available, the Ensembl gene ID will be used, instead.

0
Entering edit mode
6 months ago
Papyrus ★ 2.1k

Also, because you mention R, after retrieving gene information from a source like Alex's example, you could do:

library(GenomicFeatures)
txdb <- makeTxDbFromGFF(file = "gencode.v39lift37.annotation.gff3.gz")
genes <- genes(txdb)