Get the genes between certain chromosomal position
4
0
Entering edit mode
7.0 years ago
akij ▴ 180

I know this is very common questions. But my input list is huge. So it would be nice if I can just input my whole list and output.

So I have a huge list of chromosomal positions in a excel file that looks like below. I want to find the genes between each start and end positions of chromosome.

chr       start          end        prob_no
1   196792690   196829513       36823
1   248572774   248633821       61047
2   41011303    41024125        12822
2   52516787    52554404        376176
3   189645889   189653199        7310
3   193160173   193165114        4941

How can I get the gene lists? It would be also nice if I could get the output like this

chr       start          end        prob_no  genes
1   196792690   196829513       36823      gene1, gene2.....
gene • 2.1k views
ADD COMMENT
0
Entering edit mode

Hi akij,

If an answer was helpful you should upvote it, if the answer resolved your question you should mark it as accepted.

Cheers,
Wouter

ADD REPLY
0
Entering edit mode

OP has not been seen for >1 year. I'll toggle all upvoted answers as accepted.

ADD REPLY
1
Entering edit mode
6.9 years ago

BEDOPS bedmap does exactly what you need, outputting results in the format you want:

$ bedmap --echo --echo-map-id-uniq positions.bed genes.bed > answer.bed

If you have gene annotations in GFF or GTF format, you can convert them via BEDOPS gff2bed or gtf2bed, e.g.:

$ bedmap --echo --echo-map-id-uniq positions.bed <(gff2bed < annotations.gff) > answer.bed

Or:

$ bedmap --echo --echo-map-id-uniq positions.bed <(gtf2bed < annotations.gtf) > answer.bed

The default criteria for mapping is one or more bases of overlap between reference and annotation intervals. Add --fraction-map 1 if you want only the gene annotations reported that map entirely within the reference interval.

ADD COMMENT
1
Entering edit mode
4.7 years ago
arshiaurora ▴ 10

You can do this using R - See below my code-

library("biomaRt")
listMarts()

#use ensembl 
ensembl=useMart("ensembl")

#BioMart databases can contain several datasets, for Ensembl every species is a different dataset. 

listDatasets(ensembl)

#use homo sapiens
ensembl = useDataset("hsapiens_gene_ensembl",mart=ensembl)

say we want to find genes on chromosome 10 between - 86000000 and 87863438

 getBM(c('ensembl_gene_id','hgnc_symbol','description'), filters=c('chromosome_name','start','end'), values=list(10,86000000 , 87863438), mart=ensembl)[1:5,]

  ensembl_gene_id hgnc_symbol
1 ENSG00000182771       GRID1
2 ENSG00000200487   RNA5SP322
3 ENSG00000199104      MIR346
4 ENSG00000287475
5 ENSG00000286359
                                                                            description
1 glutamate ionotropic receptor delta type subunit 1 [Source:HGNC Symbol;Acc:HGNC:4575]
2                  RNA, 5S ribosomal pseudogene 322 [Source:HGNC Symbol;Acc:HGNC:43222]
3                                      microRNA 346 [Source:HGNC Symbol;Acc:HGNC:31780]
4                                                                      novel transcript
5                                                                      novel transcript

For a list of chromosome positions as in your list - Process each row by a for loop and pass the chrom[i], start[i] and end[i] values

ADD COMMENT
0
Entering edit mode

Really nice solution - one quick suggestion: I recommend one of the apply functions instead of a loop.

ADD REPLY
0
Entering edit mode
7.0 years ago
GenoMax 141k

Take a look at bedtools intersect (documentation) that you can use with a GTF file for your genome (with the genes in it).

ADD COMMENT
0
Entering edit mode
7.0 years ago

I have a huge list of chromosomal positions in a excel file

"huge" and "excel" are antinomic

have a look at the UCSC table browser to get the intersection with a gene track: https://genome.ucsc.edu/goldenPath/help/hgTablesHelp.html#SimpleIntersection

ADD COMMENT

Login before adding your answer.

Traffic: 2673 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6