Find a gene of interest in a species genome
2
1
Entering edit mode
2.0 years ago
▴ 20

Find a gene of interest in a species. I have a gene of a model animal for example (a certain gene in zebrafish), and I want to get this coding region (cds) of all fish for molecular evolution analysis. Due to the presence of introns, blasting the genome results in multiple hits (blast hits per exon). I need to manually adjust each hits and get the cds area I want, I want to ask if there is an easy way to achieve what I want. Through a cds region, the corresponding cds region is obtained in the genome of the species of interest.enter image description here

blast • 1.6k views
ADD COMMENT
3
Entering edit mode
2.0 years ago
Michael 54k

Easy is mostly relative to your skills, it is not trivial to do right though. Rule of thumb: the more conserved the protein sequence, the easier it becomes. For high sequence identities e.g. over 50% any approach is likely to work fine.

Easy: find orthologous sequences in fish species by using web-blast and extract the sequences of blast hits (given defined cut-offs) directly from GenBank. These will often be from predicted protein-coding sequences, which is fine if the annotation is at least half-reliable. This is all possible via the web-interface.

Easy: Extract orthologues via Ensembl's compara interface (gene tree) for the gene of interest. That will give you some orthologues, but only those in Ensembl

Intermediate: get the CDS instead of protein sequences or do the same on the commandline

Harder: if you want to include badly or unannoated genomes, you can use exonerate in est2genome or protein2genome mode. That may work well for well-conserved proteins. You can use your blast results to restrict contigs to search because exonerate alignment will be much slower than a blast run.

The following command will get you the CDS sequence:

   exonerate --model protein2genome -q query.faa -t target.fna --refine full --ryo ">%qi vs. %ti %td  %m aln.length: %qal score: %s %%-ident: %pi %%-sim: %ps strand: %tS  CDS sequence \n%tcs"

Hardest: Annotate the unannotated genomes from scratch using MAKER2, BRAKER, etc. but I do not recommend it. The result for any single protein may not be better than the previous alternative.

ADD COMMENT
0
Entering edit mode

I may need to learn the exonerate method, thank you very much for your suggestion, I have benefited a lot.This is very important to my subject :) I am going to learn it in the next few days, can I contact you on this forum in the future?

ADD REPLY
0
Entering edit mode

can I contact you on this forum in the future?

Always :)

I have used the exonerate method a lot, I can post some examples for extraction of protein sequences.

ADD REPLY
0
Entering edit mode

hey,Michael

I'm getting close to it with:

exonerate --model protein2genome query_aa.fa reference.fa

But I can't analyse the output file.can you give me some advices or operation example?

For a simple purpose, use a cds sequence of a protein gene I own, and the genome of the species of interest (unannotated) to find the protein gene in the species, and analyze its copy number and amino acid mutation.

I know the idea is simple and the operation may be complicated, as a beginner I have to learn, if you can help me I would appreciate it.

ADD REPLY
0
Entering edit mode

I am using the following command:

exonerate --model protein2genome -q query.faa -t target.fna --refine full --ryo ">%qi vs. %ti %td  %m aln.length: %qal score: %s %%-ident: %pi %%-sim: %ps strand: %tS  CDS sequence \n%tcs"

To speed up you could leave the --refine parameter out, also, only put the sequences for which you got tblastn hits into target.fna

ADD REPLY
0
Entering edit mode

When using your script, I got this error

1 ERROR was encountered in argument processing

[ 1 ] : Too many unflagged arguments

can you tell me where is the error?

ADD REPLY
0
Entering edit mode

Possibly it is a version problem? I am using exonerate version 2.4.0 on linux. I will try with a later version to reproduce.

ADD REPLY
0
Entering edit mode

it is an illegal indent because I typed an extra space. now,it running.thanks a lot.Looking forward to the outcome of the program

ADD REPLY
0
Entering edit mode

I just noticed that the output will be a bit bloated because it will also output the full alignment in text format. I like to have this for manual inspection of alignments and selecting the best one. But if you only want a FASTA style output, you might want to add --showalignment FALSE --showvulgar FALSE to the parameters.

ADD REPLY
0
Entering edit mode
2.0 years ago

short answer: no .

going from a blast hit(s) to CDS is possible but certainly not straightforward .

Best (and quickest) way would be to get our hands on the predicted CDSs from that genome and blast against those in stead of the genome.

An alternative could be to not use blast but rather an aligner that is splice aware, something like est2genome, GenomeThreader, SNAP, ... (== old school transcript aligners) and provide the query CDS as in put to those. They will align it such that the result will resemble a gene-structure/CDS.

Unless this is a homework/Class assignment then you will have to manually annotate them yourself (if that is the goal of the exercise) ;)

ADD COMMENT
0
Entering edit mode

Predicted protein CDSs based on best BLASTN and TBLASTX hits.Is this idea feasible?This is my initial idea, but it takes a lot of time to merge hits manually.

What software should I use if I follow blast based results? You said est2genome, GenomeThreader, SNAP software, do I need transcriptome sequences? I have no transcriptome data, just the sequence of a gene of interest. I am only interested in a few proteins, and only need to annotate these protein genes, can these software do it?

Can you describe the use of the software in more detail? I'm a beginner in bioinformatics, I would appreciate if you could go into a little more detail.

ADD REPLY

Login before adding your answer.

Traffic: 2589 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6