Question: Get the genes between certain chromosomal position
gravatar for akij
3.5 years ago by
akij100 wrote:

I know this is very common questions. But my input list is huge. So it would be nice if I can just input my whole list and output.

So I have a huge list of chromosomal positions in a excel file that looks like below. I want to find the genes between each start and end positions of chromosome.

chr       start          end        prob_no
1   196792690   196829513       36823
1   248572774   248633821       61047
2   41011303    41024125        12822
2   52516787    52554404        376176
3   189645889   189653199        7310
3   193160173   193165114        4941

How can I get the gene lists? It would be also nice if I could get the output like this

chr       start          end        prob_no  genes
1   196792690   196829513       36823      gene1, gene2.....
gene • 1.3k views
ADD COMMENTlink modified 14 months ago by arshiaurora10 • written 3.5 years ago by akij100

Hi akij,

If an answer was helpful you should upvote it, if the answer resolved your question you should mark it as accepted.


ADD REPLYlink written 3.5 years ago by WouterDeCoster44k

OP has not been seen for >1 year. I'll toggle all upvoted answers as accepted.

ADD REPLYlink written 14 months ago by RamRS30k
gravatar for Alex Reynolds
3.5 years ago by
Alex Reynolds31k
Seattle, WA USA
Alex Reynolds31k wrote:

BEDOPS bedmap does exactly what you need, outputting results in the format you want:

$ bedmap --echo --echo-map-id-uniq positions.bed genes.bed > answer.bed

If you have gene annotations in GFF or GTF format, you can convert them via BEDOPS gff2bed or gtf2bed, e.g.:

$ bedmap --echo --echo-map-id-uniq positions.bed <(gff2bed < annotations.gff) > answer.bed


$ bedmap --echo --echo-map-id-uniq positions.bed <(gtf2bed < annotations.gtf) > answer.bed

The default criteria for mapping is one or more bases of overlap between reference and annotation intervals. Add --fraction-map 1 if you want only the gene annotations reported that map entirely within the reference interval.

ADD COMMENTlink modified 3.5 years ago • written 3.5 years ago by Alex Reynolds31k
gravatar for arshiaurora
14 months ago by
United States
arshiaurora10 wrote:

You can do this using R - See below my code-


#use ensembl 

#BioMart databases can contain several datasets, for Ensembl every species is a different dataset. 


#use homo sapiens
ensembl = useDataset("hsapiens_gene_ensembl",mart=ensembl)

say we want to find genes on chromosome 10 between - 86000000 and 87863438

 getBM(c('ensembl_gene_id','hgnc_symbol','description'), filters=c('chromosome_name','start','end'), values=list(10,86000000 , 87863438), mart=ensembl)[1:5,]

  ensembl_gene_id hgnc_symbol
1 ENSG00000182771       GRID1
2 ENSG00000200487   RNA5SP322
3 ENSG00000199104      MIR346
4 ENSG00000287475
5 ENSG00000286359
1 glutamate ionotropic receptor delta type subunit 1 [Source:HGNC Symbol;Acc:HGNC:4575]
2                  RNA, 5S ribosomal pseudogene 322 [Source:HGNC Symbol;Acc:HGNC:43222]
3                                      microRNA 346 [Source:HGNC Symbol;Acc:HGNC:31780]
4                                                                      novel transcript
5                                                                      novel transcript

For a list of chromosome positions as in your list - Process each row by a for loop and pass the chrom[i], start[i] and end[i] values

ADD COMMENTlink written 14 months ago by arshiaurora10

Really nice solution - one quick suggestion: I recommend one of the apply functions instead of a loop.

ADD REPLYlink written 14 months ago by RamRS30k
gravatar for genomax
3.5 years ago by
United States
genomax91k wrote:

Take a look at bedtools intersect (documentation) that you can use with a GTF file for your genome (with the genes in it).

ADD COMMENTlink written 3.5 years ago by genomax91k
gravatar for Pierre Lindenbaum
3.5 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum131k wrote:

I have a huge list of chromosomal positions in a excel file

"huge" and "excel" are antinomic

have a look at the UCSC table browser to get the intersection with a gene track:

ADD COMMENTlink written 3.5 years ago by Pierre Lindenbaum131k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1698 users visited in the last hour