I am looking for transcription factor motifs. I have a list of refseq IDs of genes that I am interested in. How would i export a multi-fasta of all sequences from TSS to 1000bp before TSS?
I was thinking of putting hg18 or something in a database and then taking a cut out of it but then I would need to know refseqID -> gene bp location and I am unsure where to find this information. I wouldn't mind a web based tool that would do the same thing.
I definitely endorse the use of Galaxy due to its flexibility in handling genome coordinate based data. If you would like to retrieve the coordinates of a particular RefSeq transcript (NM_xxxxxx) from RefSeq data you can also extract it from the UCSC table browser.
select 'Table Browser' from the left-hand side panel
select mammal/human/hg18 from the top row of options
group: 'genes and gene prediction tracks'; track: 'RefSeq genes'
get output
You can load the resulting file into Galaxy and retrieve the lines of information you want by comparing your RefSeq IDs to the second column of the table browser data.
Just remember that txStart = TSS if the gene is on the + strand. txEnd = TSS if the gene is on the - strand.
Once you have proper coordinates for you genes, you can use the flankBed and fastaFromBed utilities in BEDTools
For example, assuming you have a BED file of your genes, you'd use the following (note the -s in flankBed is so that the upstream coordinates for the gene are based on the gene's strand).
I have a sort of 'in-house' solution to the problem because our organism is not well documented by the online databases and I'd like to do the same thing as you want to do. I stored all annotation information in a MySQL database and used a perl script to query the database to retrieve the upstream coordinates, then I used another perl script with these coordinates as inputs to query the reference sequences in a bam alignment using the reference accessor functions of Bio::DB::Sam, to retrieve the actual sequences for motif analysis. This solution is very flexible for non-model systems.
ADD COMMENT
• link
updated 5.2 years ago by
Ram
44k
•
written 13.2 years ago by
Vitis
★
2.6k
That's a very detailed answer than mine :)