Question

Get the genes between certain chromosomal position

0

Entering edit mode

8.2 years ago

akij ▴ 190

I know this is very common questions. But my input list is huge. So it would be nice if I can just input my whole list and output.

So I have a huge list of chromosomal positions in a excel file that looks like below. I want to find the genes between each start and end positions of chromosome.

chr       start          end        prob_no
1   196792690   196829513       36823
1   248572774   248633821       61047
2   41011303    41024125        12822
2   52516787    52554404        376176
3   189645889   189653199        7310
3   193160173   193165114        4941

How can I get the gene lists? It would be also nice if I could get the output like this

chr       start          end        prob_no  genes
1   196792690   196829513       36823      gene1, gene2.....

gene • 2.5k views

ADD COMMENT • link updated 5.9 years ago by arshiaurora ▴ 10 • written 8.2 years ago by akij ▴ 190

0

Entering edit mode

Hi akij,

If an answer was helpful you should upvote it, if the answer resolved your question you should mark it as accepted.

Cheers,
Wouter

ADD REPLY • link 8.2 years ago by WouterDeCoster 48k

0

Entering edit mode

OP has not been seen for >1 year. I'll toggle all upvoted answers as accepted.

ADD REPLY • link 5.9 years ago by Ram 45k

0

Entering edit mode

8.2 years ago

GenoMax 152k

Take a look at bedtools intersect (documentation) that you can use with a GTF file for your genome (with the genes in it).

ADD COMMENT • link 8.2 years ago by GenoMax 152k

0

Entering edit mode

8.2 years ago

Pierre Lindenbaum 166k

I have a huge list of chromosomal positions in a excel file

"huge" and "excel" are antinomic

have a look at the UCSC table browser to get the intersection with a gene track: https://genome.ucsc.edu/goldenPath/help/hgTablesHelp.html#SimpleIntersection

ADD COMMENT • link 8.2 years ago by Pierre Lindenbaum 166k

score 1 · Accepted Answer · 2017-05-07

BEDOPS bedmap does exactly what you need, outputting results in the format you want:

$ bedmap --echo --echo-map-id-uniq positions.bed genes.bed > answer.bed

If you have gene annotations in GFF or GTF format, you can convert them via BEDOPS gff2bed or gtf2bed, e.g.:

$ bedmap --echo --echo-map-id-uniq positions.bed <(gff2bed < annotations.gff) > answer.bed

Or:

$ bedmap --echo --echo-map-id-uniq positions.bed <(gtf2bed < annotations.gtf) > answer.bed

The default criteria for mapping is one or more bases of overlap between reference and annotation intervals. Add --fraction-map 1 if you want only the gene annotations reported that map entirely within the reference interval.

score 1 · Accepted Answer · 2019-08-21

You can do this using R - See below my code-

library("biomaRt")
listMarts()

#use ensembl 
ensembl=useMart("ensembl")

#BioMart databases can contain several datasets, for Ensembl every species is a different dataset. 

listDatasets(ensembl)

#use homo sapiens
ensembl = useDataset("hsapiens_gene_ensembl",mart=ensembl)

say we want to find genes on chromosome 10 between - 86000000 and 87863438

 getBM(c('ensembl_gene_id','hgnc_symbol','description'), filters=c('chromosome_name','start','end'), values=list(10,86000000 , 87863438), mart=ensembl)[1:5,]

  ensembl_gene_id hgnc_symbol
1 ENSG00000182771       GRID1
2 ENSG00000200487   RNA5SP322
3 ENSG00000199104      MIR346
4 ENSG00000287475
5 ENSG00000286359
                                                                            description
1 glutamate ionotropic receptor delta type subunit 1 [Source:HGNC Symbol;Acc:HGNC:4575]
2                  RNA, 5S ribosomal pseudogene 322 [Source:HGNC Symbol;Acc:HGNC:43222]
3                                      microRNA 346 [Source:HGNC Symbol;Acc:HGNC:31780]
4                                                                      novel transcript
5                                                                      novel transcript

For a list of chromosome positions as in your list - Process each row by a for loop and pass the chrom[i], start[i] and end[i] values