Question: Extracting Whole-Genome Information
gravatar for Tom A
6.4 years ago by
Tom A100
United States
Tom A100 wrote:


This might be an obvious question but I cant seem to find it. I need to extract a list of every gene's specific defined sequence region of the entire human genome, in a large .fasta format.

As an example, the -500 to +500 region (relative to the TSS) of every gene in hg19.

I thought there was a way to do this with the UCSC Genome Browser's "Tables" function, but I can't seem to figure it out.

Any help is appreciated. Thanks in advance.

fasta genome-browser • 2.2k views
ADD COMMENTlink modified 6.4 years ago by Matt Shirley9.4k • written 6.4 years ago by Tom A100
gravatar for Alex Reynolds
6.4 years ago by
Alex Reynolds30k
Seattle, WA USA
Alex Reynolds30k wrote:

If you want to automate this, you might try something like the following untested script.

Let's assume you have a sort-bed-sorted BED file called geneTSSs.bed that contains the coordinates of TSSs.

Let's also say you have a tabix-indexed bgzip file called hg19.gz and the associated tabix index file hg19.gz.tbi. The bgzip file is a compressed BED file containing four columns: chromosome, start, stop and an ID field containing the sequence data for that genome build, for that specified genomic range.

First, use bedops to get the BED regions that lie +/-500 bases around each TSS.

Second, pass each of those regions to awk to make a search variable called region, which is passed to tabix to search on hg19.gz. Each search result is appended to the file answer.fa:

$ bedops --range 500 --everything geneTSSs.bed \
    | awk '{ \
        region=$1":"$2"-"$3; \
        system("echo \>" region " >> answer.fa"); \
        system("tabix hg19.gz " region " | cut -f4 >> answer.fa"); \ 
    }' -

There's a clean-up step I haven't shown. Depending on the k-mer size of the sequences in hg19.gz, you may have some post-processing to do on each sequence in answer.fa, trimming away sequence from the beginning and end.

For instance, let's say that you have 1000-mers in hg19.gz. A sample region is chr2:1230-2650. The tabix search returns sequence data from chr2:1000-3000. So on the cleanup pass on answer.fa, we need to cut the first 230 bases and the last 350 bases. I haven't shown code for that, but that should be easy to whip up in Perl or Python.

Ref: The BEDOPS suite offers bedops and sort-bed. The samtools suite offers bgzip and tabix.

ADD COMMENTlink modified 6.4 years ago • written 6.4 years ago by Alex Reynolds30k

Solid suggestion. I can definitely write a script to do it, I was just wondering if there was an interface that existed that could do it. I guess it was more of a theoretical question. Thanks though!

ADD REPLYlink written 6.4 years ago by Tom A100
gravatar for Matt Shirley
6.4 years ago by
Matt Shirley9.4k
Cambridge, MA
Matt Shirley9.4k wrote:

You can certainly accomplish this using the Table browser at UCSC. See your example query result here. Simply change output format to "sequence" and then select the flank and gene features you are interested in. You'll get your FASTA sequence as a result, not the feature coordinates.

ADD COMMENTlink written 6.4 years ago by Matt Shirley9.4k

Thank you! I knew there was a way to do this on UCSC.

ADD REPLYlink written 6.4 years ago by Tom A100
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1421 users visited in the last hour