Hi All,
I have a set of genes (Entrez gene ids) for which I need promoter sequences. Now my genes are from a non model organism (Sus scrofa or pig) using an earlier gene build (Sscrofa9.2 build). I have seen Yuri's post and answers on retrieving promoter sequences and I like the idea of retrieving sequences from Ensembl Biomart.
But my concerns are
1. what about genes that cannot be mapped to Ensembl and
2. If a gene (Entrez gene id) has more than one mapping Ensembl gene id which one should I take ?
Now what I plan to do is
1. retrieve genbank and fasta files from NCBI genome ftp
2. parse genbank files for gene co - ordinates and finally
3. parse chromosome fasta file for 1,000 bp region upstream of gene start position.
But before that does anybody have a much more easier suggestion ? and if I parse the genbank and fasta files, is it possible to get sense - antisense strand information ? and what about the length of upstream sequence that I should take ?
Thank you in advance.