Question

Extracting Whole-Genome Information

0

Entering edit mode

10.1 years ago

Tom A ▴ 110

Hello,

This might be an obvious question but I cant seem to find it. I need to extract a list of every gene's specific defined sequence region of the entire human genome, in a large .fasta format.

As an example, the -500 to +500 region (relative to the TSS) of every gene in hg19.

I thought there was a way to do this with the UCSC Genome Browser's "Tables" function, but I can't seem to figure it out.

Any help is appreciated. Thanks in advance.

genome-browser fasta • 3.0k views

ADD COMMENT • link updated 10.1 years ago by Matt Shirley 10k • written 10.1 years ago by Tom A ▴ 110

score 2 · Answer 1 · 2014-03-21

If you want to automate this, you might try something like the following untested script.

Let's assume you have a sort-bed-sorted BED file called geneTSSs.bed that contains the coordinates of TSSs.

Let's also say you have a tabix-indexed bgzip file called hg19.gz and the associated tabix index file hg19.gz.tbi. The bgzip file is a compressed BED file containing four columns: chromosome, start, stop and an ID field containing the sequence data for that genome build, for that specified genomic range.

First, use bedops to get the BED regions that lie +/-500 bases around each TSS.

Second, pass each of those regions to awk to make a search variable called region, which is passed to tabix to search on hg19.gz. Each search result is appended to the file answer.fa:

$ bedops --range 500 --everything geneTSSs.bed \
    | awk '{ \
        region=$1":"$2"-"$3; \
        system("echo \>" region " >> answer.fa"); \
        system("tabix hg19.gz " region " | cut -f4 >> answer.fa"); \ 
    }' -

There's a clean-up step I haven't shown. Depending on the k-mer size of the sequences in hg19.gz, you may have some post-processing to do on each sequence in answer.fa, trimming away sequence from the beginning and end.

For instance, let's say that you have 1000-mers in hg19.gz. A sample region is chr2:1230-2650. The tabix search returns sequence data from chr2:1000-3000. So on the cleanup pass on answer.fa, we need to cut the first 230 bases and the last 350 bases. I haven't shown code for that, but that should be easy to whip up in Perl or Python.

Ref: The BEDOPS suite offers bedops and sort-bed. The samtools suite offers bgzip and tabix.

score 2 · Answer 2 · 2014-03-21

2

Entering edit mode

10.1 years ago

Matt Shirley 10k

You can certainly accomplish this using the Table browser at UCSC. See your example query result here. Simply change output format to "sequence" and then select the flank and gene features you are interested in. You'll get your FASTA sequence as a result, not the feature coordinates.