Question: How to scrape data from UCSC genome browser?
1
gravatar for ajstern
4.9 years ago by
ajstern10
United States
ajstern10 wrote:

I want to compare the quality of different human genome assemblies by looking at their inclusion of the RefSeq genes.

On the UCSC browser I can call the locations of RefSeq genes by their accession numbers in any assembly--for example, https://genome.ucsc.edu/cgi-bin/hgTracks?db=hg16&position=chr22%3A17007506-17034714&hgsid=381962759_r6crkXh3VMlFtCa2rnXaO5BTAjcH.

However, as a newbie at programming in general, I'm unsure how to scrape these inclusions or lack thereof. Anyone have tips?

ucsc refseq browser assembly genome • 2.0k views
ADD COMMENTlink modified 4.9 years ago by Bert Overduin3.6k • written 4.9 years ago by ajstern10
4
gravatar for Pierre Lindenbaum
4.9 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum120k wrote:

download the raw data:

http://hgdownload.cse.ucsc.edu/goldenPath/hg16/database/refGene.txt.gz

http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/refGene.txt.gz

ADD COMMENTlink written 4.9 years ago by Pierre Lindenbaum120k
3
gravatar for Devon Ryan
4.9 years ago by
Devon Ryan90k
Freiburg, Germany
Devon Ryan90k wrote:

Instead of pissing off UCSC by throwing hundreds of thousands of queries their way, why not just download the various annotation tables (via the table browser or ftp site) and simply process those? That would be vastly simpler than scraping a bunch of web pages.

ADD COMMENTlink modified 4.9 years ago • written 4.9 years ago by Devon Ryan90k
2
gravatar for Maximilian Haeussler
4.9 years ago by
UCSC
Maximilian Haeussler1.3k wrote:

Don't scape it. You can query their mysql server directly: 

mysql --user=genome --host=genome-mysql.cse.ucsc.edu -NB -e 'select * from refGene'
ADD COMMENTlink written 4.9 years ago by Maximilian Haeussler1.3k

I agree, I use mysql for the same job, it's very easy to use and you can also download refseq on your computer. It requires almost no computer power, works fine on laptop too.

ADD REPLYlink written 4.9 years ago by madkitty580
0
gravatar for Bert Overduin
4.9 years ago by
Bert Overduin3.6k
Edinburgh Genomics, The University of Edinburgh
Bert Overduin3.6k wrote:

As a former Ensembl team member, I just want to emphasise that scraping websites is absolutely NOT DONE!!! I know of people who were scraping the Ensembl Genome Browser website and were given IP bans because of this (which were lifted again after Ensembl spoke with them and told them how to get the desired data without slowing down / bringing down their production webservers). So, please be aware of this! As already indicated in the other responses, there are many other ways to get the UCSC data (mysql, downloads, Table browser).

ADD COMMENTlink written 4.9 years ago by Bert Overduin3.6k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1770 users visited in the last hour