Hi all,
I've collected sequences from 5 species for 10 different genes. My method was to find the gene RefSeq numbers from my reference genome (Drosophila melanogaster) and type this into the the search bar in genome browser for the other species (other Drosophila species).
This has returned more than one sequence for each gene per species (e.g. if I'm looking for the gene HDAC4 in Drosophila simulans, it returns 3-4 sequences instead of the expected 1) which makes me wonder, which sequence should I pick? Is there an optimal method for doing this, or do you have any advice?
I'd really appreciate any help on this one!
Best wishes,
David
Hi GenoMax,
Thanks for the reply!
HDAC4 was a random gene from the top of my head, apologies for the confusion.
I'm looking for the gene, so when I search for this using the melanogaster refseq number in UCSC genome browser I get multiple genes on different chromosomes for, for instance, simulans.
I tried NCBI and the gene database but it seemed to only export the coding regions, and I'm interested in the full sequence. I also want to take 1000 bases upstream and downstream of the gene, which I don't think you can do in the gene database?
Anyway yeah, these a bit far from my original question which is if you have multiple sequences for one gene, how to you decide which to choose? Is there an example of someone doing this? I haven't been able to find anything.
Unfortunately NCBI and Ensembl only carry the melanogaster genome. So you are going to be limited to UCSC or flybase.org (no longer free I think) for this.
I think your best bet is to grab the genes you need from melanogaster and then identify homologous regions from genome files you can download from UCSC: https://hgdownload.soe.ucsc.edu/downloads.html
Yeah this is what I've already done, my question was: if you have multiple sequences for one gene in the same species and in different locations (i.e. different chromosomes), how to you decide which to choose?
You will have to do a careful analysis of them by doing sequence alignments to make sure they are real orthologs/paralogs. Other drosophila genomes are probably nowhere near complete as melanogaster and that can prove a challenge.
What is the ultimate goal here? Has this analysis not been done by other fly people over the years?
Ok that make sense, this is what I thought but I wasn't sure if there was a more automated method. Yes you're right, but they seem to align fairly well.
We're looking at conservation in promotor and intronic regions of a particular set of genes regulated by a particular DNA binding protein. The data are supplementary to some expression observations we've made - I hope this makes sense.
If they align well then that should make your job a bit easier. Go for hits on the longest contigs since those are likely to be of good quality than hits to small contigs.
Ok will do, thanks for the help!