How Find Genes On Specific Positions
5
3
Entering edit mode
12.3 years ago
Assa Yeroslaviz ★ 1.8k

Hello,

I have a table for chromosomal positioning fro the human genome of the following structure:

chromosome   start    end   
chrY      3447096   6114264
chrY      6114264   6733959
chrY      6733959   7142013
chrY      7142013   9175071
...

I would like to find for each of these regions the genes which overlaps these region, independent of completely or only a partial overlap.

Can someone please recommend a way to it. Is it possible to do such analysis using R or a direct query of the UCSC genome broswer. The problem with UCSC is that there are no genome coordinates table to download, only CDS or transcripts. I tried to run biomart, but I keep getting an error message.

I will appreciate any help and ideas

Thanks A.

human gene overlap • 11k views
ADD COMMENT
4
Entering edit mode

"The problem with UCSC is that there are no genome coordinates table to download, only CDS or transcripts." huh ? http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/knownGene.txt.gz

ADD REPLY
1
Entering edit mode

Which BioMart are you using, what is your exact query and what error message do you get?

ADD REPLY
0
Entering edit mode

FYI, I just tried BioMart with the four region you mention as input and it works perfectly fine ....

ADD REPLY
0
Entering edit mode

FYI, I just tried BioMart with the four regions you mentioned as input and it works perfectly fine for me ....

ADD REPLY
0
Entering edit mode

I was using the biomart web browser and got just the following message: " Error has occurred" with no other specifications.

I will try to run it with the R package on the command line and see what will happens.

ADD REPLY
4
Entering edit mode
12.3 years ago

I would point you to the UCSC Table Browser. Most people get a bit confused about the interface here, but what you want to do is specify a group, track, and table. The default is UCSC genes, and the table is "knownGene", which will give you what you want. You might also consider using RefSeq, for more conservative and curated results.

From your table of coordinates, you can select a region such as "chrY:3447096-6114264". At this point you have limited your region of interest to the above coordinates.

Now, click "summary/statistics" to get a count of items in that region, or choose "get output" to give you the data describing the features in that region of the table you have selected. You can also send your output to a BED file, or to Galaxy.

You can also supply an entire BED file containing your regions, and then use the "intersection" tool to get information for all of your regions at once.

ADD COMMENT
3
Entering edit mode
12.1 years ago
Francy ▴ 30

I have the same problem and I am trying to solve this all in R by doing this:

  1. Use BiomaRt to get positions of all genes:

genes<-getBM(c("hgnc_symbol","ensembl_gene_id","chromosome_name","start_position","end_position"), mart=mart)

  1. Use genomicRanges to find the overlap between my dataset called "probes" and the output of BiomaRt. I still have not figured out why my overlap is not working though, but my problem is that the interval matching returns more fields than one per probe.

I'll reproduce the code I am trying here if anybody has suggestions on how to improve it. Hope it helps.

probes<- data.frame(gene=c(8503721, 352341, 251113), chrom=c(2,2,11),probe_start=c(213865547, 1636127, 131062588), probe_end=c(213865606, 1636176, 131062647))

library(GenomicRanges)

probes_int <- GRanges(seqnames = Rle(probes$chrom),ranges = IRanges(start = probes$probe_start, end = probes$probe_end), names = probes$gene)

genes_int <- GRanges(seqnames = Rle(genes$chromosome_name), ranges = IRanges(start = genes$start_position, end = genes$end_position), names = genes$hgnc_symbol)

overlaps<- findOverlaps(probes_int, genes_int, type="within")
ADD COMMENT
0
Entering edit mode

how to instal GenomicRanges packages? I did not find it in R studio

ADD REPLY
0
Entering edit mode

Just do

BiocManager::install("GenomicRanges")

from within Rstudio.

ADD REPLY
2
Entering edit mode
12.3 years ago

Multiple potential solutions.

  • The GenomicRanges and GenomicFeatures Bioconductor packages and rtracklayer to load your data in BED format.
  • The Galaxy server.
  • Directly at UCSC via a custom track and then get overlapping features via table browser.
  • After downloading your genes of interest from UCSC or biomart, you could use BEDtools to do the overlap. If you are python person, think about bx-python or pybedtools.
ADD COMMENT
0
Entering edit mode

Can you give me a hint as toh ow to use the GenomicRanges for that?

I tried it and built two GRanges objects, but got nothing further out of it. I would like to have the names of the genes in these regions, but somehow can't get it into the commands/workflow

ADD REPLY
2
Entering edit mode
12.3 years ago
Rm 8.3k

you can use lh3's perl script batchUCSC.pl to get the information you wanted.

cat input.bed| ./batchUCSC.pl -d hg19 -p refGeneBrief > geneinfo.txt
ADD COMMENT
1
Entering edit mode
12.1 years ago
Bert Overduin ★ 3.7k

If there is no definite need to do this with R, you can also use the Ensembl Perl Core API for this. On the Ensembl website you can find more information and the installation instructions.

An example script:

#!/usr/bin/perl

use Bio::EnsEMBL::Registry;

my $registry = 'Bio::EnsEMBL::Registry';

## Load the databases into the registry
$registry->load_registry_from_db( -host => 'ensembldb.ensembl.org', -user => 'anonymous' );

## Get the slice adaptor for human
my $slice_adaptor = $registry->get_adaptor( 'Human', 'Core', 'Slice' );

## Get the slice for your region of interest
my $slice = $slice_adaptor->fetch_by_region( 'chromosome', 'Y', 3447096, 6114264 );

## Get all genes overlapping the slice
my $genes = $slice->get_all_Genes;

## Print info about the genes
while( my $gene = shift @$genes){
  print
    $gene->stable_id, "\t",
    $gene->external_name, "\t",
    $gene->seq_region_name, "\t",
    $gene->seq_region_start, "\t",
    $gene->seq_region_end, "\t",
    $gene->seq_region_strand, "\n";
}

farm2-head1 /~/scripts/ perl genes_from_region.pl 
ENSG00000176679 TGIF2LY Y   3447082 3448082 1
ENSG00000232927 USP12PY Y   3550846 3551909 1
ENSG00000225653 RNF19BPY    Y   3646038 3647587 -1
ENSG00000218410 AC012078.2.1    Y   3719265 3720910 -1    
etc. etc.
ADD COMMENT

Login before adding your answer.

Traffic: 2946 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6