Question

Augustus result interpretation

0

Entering edit mode

10.4 years ago

bioinformaticssrm2011 ▴ 90

Hi,

I am new to Augustus and I used Augustus for fungal genome analysis.

I used command:

augustus --species=human --UTR=on sequence.fasta > sequence_augustus.gff

I got result

 ----- prediction on sequence number 1 (length = 11239, name = contig00001) -----
# Constraints/Hints:
# Predicted genes for sequence number 1 on both strands
# start gene g1
contig00001    AUGUSTUS    gene    1476    4367    1    +    .    g1
contig00001    AUGUSTUS    transcript    1476    4367    .    +    .    g1.t1
contig00001    AUGUSTUS    tss    1476    1476    .    +    .    transcript_id "g1.t1"; gene_id "g1";
contig00001    AUGUSTUS    exon    1476    1559    .    +    .    transcript_id "g1.t1"; gene_id "g1";
contig00001    AUGUSTUS    exon    2030    4367    .    +    .    transcript_id "g1.t1"; gene_id "g1";
contig00001    AUGUSTUS    start_codon    2378    2380    .    +    0    transcript_id "g1.t1"; gene_id "g1";
contig00001    AUGUSTUS    CDS    2378    3223    .    +    0    transcript_id "g1.t1"; gene_id "g1";
contig00001    AUGUSTUS    stop_codon    3221    3223    .    +    0    transcript_id "g1.t1"; gene_id "g1";
contig00001    AUGUSTUS    tts    4367    4367    .    +    .    transcript_id "g1.t1"; gene_id "g1";
# protein sequence = [MLDLEIAKNCADGELDGKMVEEHPGVDNKDGSHTDSKGGNKAKGEADWAGQAELPSHVTQTPIETELPLTIAPAIDAH
# TATEGVVSAAVVAANTRATASIGTSNALGNLSKLPEISRSLIYHFVTAETDFPCVNSECRPPRLLSTISKTIKKEVELYYRCNHRTLMVELNPFEFHH
# LEDSFPAWSLEEYRAYMAPYMKPVTSRVHDLSIVDEVIIHLADDGPDCLTLLYTLALDHTDPTIPGQSDHEPFRLTLEYASAEGDETRDMHNHEYCTR
# FRTFFEP]
# end gene g1

My question what above result indicates? And what about protein sequence? If this protein correspond to the first Nucleotide sequence (contig 1) in my complete fasta file (contains many contigs), should I use this protein sequence and do the BATCH CD search for annotation? Or any other tools for annotation for fungal genome?

Any suggestions.

Regards!

Shashank

gene-sequence genome-sequencing • 8.4k views

ADD COMMENT • link updated 3.2 years ago by Ram 45k • written 10.4 years ago by bioinformaticssrm2011 ▴ 90

Ram · Answer 1 · 2015-03-04

The result is in gff 2 format. There are additional comment lines with predicted protein sequence. The gff file describes the predicted gene models. The amino acids sequence is the translation of the predicted coding sequence of the first predicted gene on contig 1 (between start codon (included as 'M') and stop codon). You can use BlastP vs NR to for a quick search.

However, your output is most likely bogus and cannot be used, because you used human training data (at least that is what -species=human switch indicates), but your genome is fungal. This doesn't work. For predicting eukaryote genes, you need appropriate training data from your organism or closely related organisms, e.g. RNA-seq data, full length cDNA, related organism's protein sequences, etc.

If you blast your predicted protein sequence from the example, one gets only very weak hits, none significant, of course this might be an exception. You could check all predictions like that if you don't believe me about the importance of training data; a very large proportion of predicted AA might not have significant hits, indicating that the prediction is not good.

If you want a state-of-the-art gene prediction, you should look at pipelines like MAKER, which include several tools, like Augustus, Snap, integrate evidence, proper repeat masking, and re-training.