Getting Bed File Output From Ucsc Via Mysql
Entering edit mode
10.2 years ago
Ryan D ★ 3.4k

Trying to get regions for sequencing, we want to pull all ESTs/exons in a given region for SureSelect capture.

Right now, when I try to pull by mySQL, it gives me this information in a rather unweildly format. I just want to get the BED file output from tables like "all_est" which I could then merge in Galaxy. In other words, I want to get one line per "block" instead of the standard output which comma-delimits all starts and ends. Looking for something like below. Any ideas?

   track name="tb_all_est" description="table browser query on all_est" visibility=3 url= 
chr10    123227833    123228018    AI379298_exon_0_0_chr10_123227834_f    0    +
chr10    123227833    123228125    AI042628_exon_0_0_chr10_123227834_f    0    +
chr10    123227833    123228288    AI076491_exon_0_0_chr10_123227834_f    0    +
chr10    123227833    123228353    AW514184_exon_0_0_chr10_123227834_f    0    +
chr10    123227833    123228483    CB305736_exon_0_0_chr10_123227834_f    0    +
chr10    123227833    123228310    AI559529_exon_0_0_chr10_123227834_f    0    +
chr10    123228311    123228368    AI559529_exon_1_0_chr10_123228312_f    0    +
ucsc est mysql bed • 3.2k views
Entering edit mode
10.2 years ago

Use the mysql -B switch to get batch (tab-delimited) output and the -N option to suppress column headers:

mysql -N -B -h -A -u genome -D hg19 -e \ 
'select tname, tStart, tEnd from all_est limit 10'

chr1    14405   14592
chr1    14509   14778
chr1    14654   15884
chr1    15799   17055
chr1    16441   16763
chr1    16441   18056
chr1    16618   17298
chr1    16669   17296
chr1    16696   24803
chr1    16745   16934

Just put whatever columns you require for your BED flavour in the query and redirect (appending) to a file which already contains the BED header.

EDIT: As Casey has pointed out, if you really must use UCSC and MySQL, you will have a problem with those pesky composite values. MySQL doesn't have a built-in string-splitting function and even if it did, the resulting query would be horrible.

If you still want to stick with the UCSC and MySQL combo, you could pipe the query result through your preferred scripting language. The script would need to split the composite field(s) in each line and immediately emit several lines containing the mixture of original and decomposed values.

Entering edit mode
10.1 years ago

The tool that you need is called bedToExons and is part of the UCSC source tools. It will extract the blocks from the ,-split values.


The ,-sep-format is not very convenient for what you want to do. But it's a very compact format.

Entering edit mode
10.2 years ago

If I understand this question correctly it is not so much about the format of the SQL results, as about the manner in which exons of a transcript are represented at UCSC. For mRNAs and ESTs, UCSC represents all exons of a transcript as a "block" on one line (the PSL format), with tStarts and blockSizes as two lists of comma separated values.

While Keith's query will give the chr, start and end of the entire mRNA/EST span, it won't give the individual blocks as one line for multi-exonic entities, e.g.:

mysql -B -h -A -u genome -D hg19 -e 'select tname, tStart, tEnd, tStarts, blockSizes from all_est limit 10'

tName   tStart  tEnd    tStarts blockSizes
chr1    14405   14592   14405,  187,
chr1    14509   14778   14509,14599,    89,179,
chr1    14654   15884   14654,14784,14969,15795,15846,  129,45,69,50,38,
chr1    15799   17055   15799,15820,15827,15839,15845,16868,    12,7,12,5,60,187,
chr1    16441   16763   16441,  322,
chr1    16441   18056   16441,17656,17914,  188,86,142,
chr1    16618   17298   16618,16856,16909,16921,17232,  150,52,11,134,66,
chr1    16669   17296   16669,16731,16856,16909,16921,16966,17094,17236,    59,37,52,11,44,89,4,60,
chr1    16696   24803   16696,17617,17710,17914,18267,24737,    42,92,32,147,99,66,
chr1    16745   16934   16745,16857,16911,16925,    20,53,13,9,

To get each exon out as one line requires splitting the tStarts and blockSizes lists and adding blockSizes to tStarts to get tEnds. This is one downside of the way genes are modeled in UCSC, and where Ensembl is better since it stores each exon as a separate entity with its own id.

One way to get each exon in a given region as separate lines in BED-like tab-delimited format is to use BioMart's MartService with this script and the following XML:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE Query>
<Query  virtualSchemaName = "default" formatter = "TSV" header = "0" uniqueRows = "0" count = "" datasetConfigVersion = "0.6" >

    <Dataset name = "hsapiens_gene_ensembl" interface = "default" >
        <Filter name = "chromosome_name" value = "1"/>
        <Filter name = "end" value = "30000"/>
        <Filter name = "start" value = "10000"/>
        <Attribute name = "chromosome_name" />
        <Attribute name = "exon_chrom_start" />
        <Attribute name = "exon_chrom_end" />
        <Attribute name = "ensembl_exon_id" />

Of course you'll need to modify your species & chromosomal regions and append the "chr" to the chromosome name to make it proper BED format.

Entering edit mode

And I thought I had got away with quietly "forgetting" the composite values!


Login before adding your answer.

Traffic: 1745 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6