Extracting Whole-Genome Information
2
0
Entering edit mode
7.4 years ago
Tom A ▴ 110

Hello,

This might be an obvious question but I cant seem to find it. I need to extract a list of every gene's specific defined sequence region of the entire human genome, in a large .fasta format.

As an example, the -500 to +500 region (relative to the TSS) of every gene in hg19.

I thought there was a way to do this with the UCSC Genome Browser's "Tables" function, but I can't seem to figure it out.

Any help is appreciated. Thanks in advance.

genome-browser fasta • 2.4k views
ADD COMMENT
2
Entering edit mode
7.4 years ago

If you want to automate this, you might try something like the following untested script.

Let's assume you have a sort-bed-sorted BED file called geneTSSs.bed that contains the coordinates of TSSs.

Let's also say you have a tabix-indexed bgzip file called hg19.gz and the associated tabix index file hg19.gz.tbi. The bgzip file is a compressed BED file containing four columns: chromosome, start, stop and an ID field containing the sequence data for that genome build, for that specified genomic range.

First, use bedops to get the BED regions that lie +/-500 bases around each TSS.

Second, pass each of those regions to awk to make a search variable called region, which is passed to tabix to search on hg19.gz. Each search result is appended to the file answer.fa:

$ bedops --range 500 --everything geneTSSs.bed \
    | awk '{ \
        region=$1":"$2"-"$3; \
        system("echo \>" region " >> answer.fa"); \
        system("tabix hg19.gz " region " | cut -f4 >> answer.fa"); \ 
    }' -

There's a clean-up step I haven't shown. Depending on the k-mer size of the sequences in hg19.gz, you may have some post-processing to do on each sequence in answer.fa, trimming away sequence from the beginning and end.

For instance, let's say that you have 1000-mers in hg19.gz. A sample region is chr2:1230-2650. The tabix search returns sequence data from chr2:1000-3000. So on the cleanup pass on answer.fa, we need to cut the first 230 bases and the last 350 bases. I haven't shown code for that, but that should be easy to whip up in Perl or Python.

Ref: The BEDOPS suite offers bedops and sort-bed. The samtools suite offers bgzip and tabix.

ADD COMMENT
0
Entering edit mode

Solid suggestion. I can definitely write a script to do it, I was just wondering if there was an interface that existed that could do it. I guess it was more of a theoretical question. Thanks though!

ADD REPLY
2
Entering edit mode
7.4 years ago

You can certainly accomplish this using the Table browser at UCSC. See your example query result here. Simply change output format to "sequence" and then select the flank and gene features you are interested in. You'll get your FASTA sequence as a result, not the feature coordinates.

ADD COMMENT
0
Entering edit mode

Thank you! I knew there was a way to do this on UCSC.

ADD REPLY

Login before adding your answer.

Traffic: 1521 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6