I'm trying to find a certain gene in a non-model organism. When I BLAST the gene sequence from the closest related organism I can find on NCBI (same phylum but different class), the best hit has only a 30% match. I've tried this on two different transcriptomes and got similar results (note that I know the transcriptomes are not comprehensive. They have issues and gaps).
I've also compiled a protein MSA of this gene from organisms ranging from vertebrates to C elegans and see typically >80% identity across the orthologs. I used this MSA with HMMER to scan the transcriptomes and also got the same poor matches as the BLAST hits (when found hits at all).
In desperation, I've also tried aligning my RNASeq reads to the consensus sequence with bowtie2 to see if maybe the sequence is out there but missing in the transcriptomes. I set bowtie2's penalty for N's to zero since the consensus sequence had lots of Ns. This also results in scant alignments of reads.
Despite this all, I believe the gene exists given its biological importance, but am at a loss on how to find the sequence. I'm still learning lots in this field and am interested from more experienced hands, what other steps could I take to find this gene sequence in my organism?
From your own RNAseq data? Trying to align short reads to transcriptomes to find a gene is probably going to work as best as you discovered. Sounds like you are identifying some domain homologies with approach you have tried.
Have you thought of doing a transcriptome assembly with your own data, then translating the assembly to do protein based searches. May have a bit better luck that way.
Thanks for the thought. Yes, one of the two transcriptomes I've been searching has been my own assembled from my reads. I've been using tblastn on this assembly. Is there some benefit to manually translating my transcriptome with TransDecoder.LongOrfs or something similar over doing the translation with tblastn?
Searches in protein space are preferred so there may be some benefit to translating your assembly and trying a direct blastp search. Keep the search space small to begin with so you have a better chance of finding the protein. If this is something essential then it should be represented in your RNAseq unless you did not sequence deep enough and/or the gene is low copy/expression one.