How to identify which genes fall within genomic locations
1
0
Entering edit mode
8.2 years ago
BioProg • 0

I have an extended list of gene locations (chr#, start and end site) with respect to hg19 as extracted from not very specific alignments. I would like to find which genes fall in each of the regions (which consists of >20,000). For instance:

chr1:33183334-33190838
chr10:93066718-93371217

I would like to see what genes fall between each.. indeed, if I had just a handful, I could go to genome.browser and just look there and jot down the genes. However, I need this automated as I have > 20,000 of those and not all of them fall perfectly within the exact start to end of a gene. So if it was in the middle of a gene, I still need to account for that. I think if I have a file of the following columns: GeneID (NOT Transcript), Gene Name, chromosome, Start and End (i.e. bed file) but with respect to hg19 and with the GENE id (not transcript ID) as identifier then I could use that and write a python code. But I couldn't find such a file with GENE ID at beginning. If anyone has any suggestions, please let me know.

genomic-location hg19 • 2.4k views
ADD COMMENT
1
Entering edit mode

No need to write code. This will do what you need: http://bedtools.readthedocs.org/en/latest/content/tools/intersect.html

ADD REPLY
0
Entering edit mode

oh nice, but what would my reference to intersect to be? That's the problem I can't find a file with "GeneID" "GeneName" Chromosome location.

ADD REPLY
1
Entering edit mode

You create one. See my answer.

ADD REPLY
0
Entering edit mode

Here is what I did:

I downloaded a reference file from UCSC hg19 and fixed columns to have this structure:

chr1    11868   14409   ENST00000456328 ENSG00000223972 +   3
chr1    11871   14412   ENST00000515242 ENSG00000223972 +   3
chr1    11873   14409   ENST00000518655 ENSG00000223972 +   4
chr1    12009   13670   ENST00000450305 ENSG00000223972 +   6
chr1    14362   29370   ENST00000423562 ENSG00000227232 -   10
chr1    14362   24886   ENST00000541675 ENSG00000227232 -   9
chr1    14362   29370   ENST00000438504 ENSG00000227232 -   12
chr1    14403   29570   ENST00000488147 ENSG00000227232 -   11
chr1    14410   29806   ENST00000538476 ENSG00000227232 -   13
chr1    29553   31097   ENST00000473358 ENSG00000243485 +   3
chr1    30266   31109   ENST00000469289 ENSG00000243485 +   2
chr1    30365   30503   ENST00000607096 ENSG00000243485 +   1

Then, I got my input file (which has the query locations) as such:

chr1    30266   31109   linc|3P|XLOC_000002|TCONS_00000003:0.587988|87AA|36AA|
chr1    139789  140339  linc|3P|XLOC_002022|TCONS_00012934:43.2908|71AA|60AA|
chr1    141424  149707  linc|3P|XLOC_002023|TCONS_00012935:15.2463|111AA|90AA|
chr1    141473  149707  linc|3P|XLOC_002023|TCONS_00012937:24.0251|111AA|90AA|
chr1    142807  146831  linc|3P|XLOC_002023|TCONS_00012938:17.427|88AA|40AA|
chr1    160445  161525  linc|3P|XLOC_000004|TCONS_00000005:0.411134|53AA|33AA|
chr1    230207  259183  linc|3P|XLOC_002021|TCONS_00012944:0.359946|78AA|40AA|

Then I used the command:

bedtools intersect -a QuerryList.bed.xls -b ensGene.txt > Overlap.bed

My understanding is that each row in the querrylist file will be intersected with the ensGene.txt. However, the output file had the same # of lines as ensGene (which is 3 x my query list) and did not get me the output of interest. I tried swathing -a and -b but that did not help.

Any advice?

ADD REPLY
0
Entering edit mode

I double checked. my bad, the result is just a representation of multiple overlaps. I thought it just matches one to one. Thanks.

ADD REPLY
0
Entering edit mode
8.2 years ago

Download annotations for your genome and turn them into a BED file, e.g.:

$ wget -qO- ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_21/gencode.v21.annotation.gff3.gz \
    | gunzip --stdout - \
    | gff2bed - \
    | awk '$8=="gene"' \
    > genes.bed

If you want HGNC gene symbols, you could run the following command (assuming hg19):

$ mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -N -e "SELECT k.chrom, kg.txStart, kg.txEnd, x.geneSymbol FROM knownCanonical k, knownGene kg, kgXref x WHERE k.transcript = x.kgID AND k.transcript = kg.name" hg19 > genes.bed

However you get genes into BED format, you can then map those genes to your ad-hoc region with BEDOPS bedmap, e.g.:

$ echo -e 'chr1\t33183334\t33190838' | bedmap --echo --echo-map-id-uniq --delim '\t' - genes.bed > answer.bed

If you have regions in a sorted BED file:

$ bedmap --echo --echo-map-id-uniq --delim '\t' regions.bed genes.bed > answer.bed
ADD COMMENT
0
Entering edit mode

Hi Alex- I am not an expert in this. I was able to download file by going to the link below (I dont have wget function)

ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_21/gencode.v21.annotation.gff3.gz

I unzipped it, I then installed BEDOPS and tried:

gff2bed --input=unzipd.gencode.v21.annotation.gff3 | awk '$8=="gene"' > genes.bed

However, it doesn't work. it gives me an empty genes.bed file. Further, I dont know how to get the hg19 version (as that's what my coordinates are). I google v21 and it seems to be hg39. I still didn't get to the bedmap step ....

ADD REPLY
0
Entering edit mode

Hi Alex- I am not an expert in this. I was able to download file by going to the link below (I dont have wget function)

ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_21/gencode.v21.annotation.gff3.gz

I unzipped it, I then installed BEDOPS and tried:

gff2bed --input=unzipd.gencode.v21.annotation.gff3 | awk '$8=="gene"' > genes.bed

However, it doesn't work. it gives me an empty genes.bed file. Further, I dont know how to get the hg19 version (as that's what my coordinates are). I google v21 and it seems to be hg39. I still didn't get to the bedmap step ....

ADD REPLY
0
Entering edit mode

Can you note what platform and version of things you are running? It should be pretty easy to get you set up with tools.

There are different releases on Gencode's site. You can look at the reference releases section to get the links for hg19 data.

For example, release 19 includes gene annotations called for hg19.

Add a brief note as to what you're running and we can suggest ways to install wget, curl or other similar tools. Once you have downloaded the annotations, running bedmap gives you output in an easier format.

ADD REPLY

Login before adding your answer.

Traffic: 1960 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6