Dump Upstream Sequence
6
8
Entering edit mode
14.3 years ago
Ying W ★ 4.3k

I am looking for transcription factor motifs. I have a list of refseq IDs of genes that I am interested in. How would i export a multi-fasta of all sequences from TSS to 1000bp before TSS?

I was thinking of putting hg18 or something in a database and then taking a cut out of it but then I would need to know refseqID -> gene bp location and I am unsure where to find this information. I wouldn't mind a web based tool that would do the same thing.

fasta motif galaxy • 7.3k views
ADD COMMENT
10
Entering edit mode
14.3 years ago
Neilfws 49k

You can use either Galaxy or BioMart for this purpose. Both resources require some exploration to use effectively, but they are reasonably intuitive.

Here are the steps for using BioMart:

  1. Go to BioMart and select "Martview" from the menu (top of page)
  2. Choose "Ensembl Genes 59" for the database and "Homo sapiens genes GRCh37" for dataset
  3. In left menu, choose "Filters" and expand the "Gene" section
  4. Check "ID list limit", choose "RefSeq DNA ID" and paste or upload your ID list
  5. In left menu, choose "Attributes", check "Sequences" and expand "SEQUENCES"
  6. Select what you want to retrieve (e.g. "Flank Transcript) and enter the upstream flank
  7. Choose any other attributes that you would like to be returned
  8. When ready, click "Results" (menu bar above the left menu)
ADD COMMENT
0
Entering edit mode

That's a very detailed answer than mine :)

ADD REPLY
4
Entering edit mode
14.3 years ago

You can perform a search to retrieve upstream sequences using your RefSeq ids in Biomart.

You may use the following parameters:

Dataset: Homo sapiens genes (GRCh37)
Filters: Refseq IDs 
Attributes: Sequences

Within attributes you can give the upstream 1000bp based on your requirement(for example 5' UTR, Flank (Gene) etc. ).

ADD COMMENT
4
Entering edit mode
14.3 years ago
Ian 6.1k

I definitely endorse the use of Galaxy due to its flexibility in handling genome coordinate based data. If you would like to retrieve the coordinates of a particular RefSeq transcript (NM_xxxxxx) from RefSeq data you can also extract it from the UCSC table browser.

  • http://genome.ucsc.edu/
  • select 'Table Browser' from the left-hand side panel
  • select mammal/human/hg18 from the top row of options
  • group: 'genes and gene prediction tracks'; track: 'RefSeq genes'
  • get output

You can load the resulting file into Galaxy and retrieve the lines of information you want by comparing your RefSeq IDs to the second column of the table browser data.

Just remember that txStart = TSS if the gene is on the + strand. txEnd = TSS if the gene is on the - strand.

ADD COMMENT
4
Entering edit mode
13.2 years ago

Once you have proper coordinates for you genes, you can use the flankBed and fastaFromBed utilities in BEDTools

For example, assuming you have a BED file of your genes, you'd use the following (note the -s in flankBed is so that the upstream coordinates for the gene are based on the gene's strand).

flankBed -i genes.bed -s -l 1000 -r 0 | \
         fastaFromBed -fi genome.fa -bed stdin \
         > genes.1kb-upstream.fa
ADD COMMENT
1
Entering edit mode
14.3 years ago
Treylathe ▴ 950

Galaxy would do this well. They have a bunchn of tutorial screen casts and an introductory tutorial to get you started. http://www.usegalaxy.org http://www.openhelix.com/galaxy

ADD COMMENT
1
Entering edit mode
13.2 years ago
Vitis ★ 2.6k

I have a sort of 'in-house' solution to the problem because our organism is not well documented by the online databases and I'd like to do the same thing as you want to do. I stored all annotation information in a MySQL database and used a perl script to query the database to retrieve the upstream coordinates, then I used another perl script with these coordinates as inputs to query the reference sequences in a bam alignment using the reference accessor functions of Bio::DB::Sam, to retrieve the actual sequences for motif analysis. This solution is very flexible for non-model systems.

ADD COMMENT

Login before adding your answer.

Traffic: 2370 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6