Homology based gene prediction
Entering edit mode
4.5 years ago
bio_d ▴ 20

I am trying gene identification and annotation of a draft genome (de novo assembly) of a non-model species. I have downloaded protein (fasta) files of three closely related species and employed tblastn with the proteins as query and repeat masked draft (de novo) assembly as target and obtained output with outfmt 6.

This is a pipeline I have come across:

fastaindex genome.fa genome.index
fastafetch genome.fa genome.index seq > seq.fa
fastasubseq seq.fa start length > genomic.fa

exonerate -m p2g --showtargetgff -q $sample -t genomic.fa > $sample.exonerate.gff
egrep -w exon $sample.exonerate.gff > $sample.exon.gff

fastaseqfromGFF.pl genomic.fa $sample.exon.gff > $sampl.pred.nuc.fa

fastatranslate -F 1 $sample_pred.nuc.fa > $sample_pred.aa.fa
t_coffee $sample.pred.aa.fa $sample > $sample.tcoffee

Since the draft genome is fragmented there were multiple hits for any given protein. I have filtered the output of the tblastn step to have hits that satisfy the following pattern which guarantees that query hits have monotonically increasing coordinates and no overlap of query sequences (protein)

    proteinSequenceID_n_th q_start_1 q_end_1 scaffold_i  s_start_i s_end_i .........
    proteinSequenceID_n_th q_start_2 q_end_2 scaffold_j  s_start_j s_end_j .........
    proteinSequenceID_n_th q_start_3 q_end_3 scaffold_k  s_start_k s_end_k .........  
    proteinSequenceID_m_th q_start_1 q_end_1 scaffold_t  s_start_t s_end_t .........

where q_end_1 <= q_start_2 and q_end_2 <= q_start_3. Also after initial filtering scaffold lengths are kept positive by switching s_start with s_end in case s_start > s_end (necessary for employing fastasubseq which only accepts positive length).

Now after I employed the entire pipeline in a loop for all the protein sequences (15k to 20k for each of the three species) I am left with numerous gff files, fa files (predicted amino acids) and tcoffee alignment files.

#!/usr/bin/env bash

noOverlapTblastnOutput=$1    #file with start and end of proteins and start-end of scaffolds(having hits) with no overlap and scaffold start < end

GENOME=$2                    #Reference genome fasta file 

GENOMEINDEX=$3               #Reference genome indexed file 

PROTEINFILE=$4               #Multiple Proteins in fasta format with each sequence in single line only

IFS=$'\n'; for line in `cat $noOverlapTblastnOutput`; do qid=`echo $line | awk '{print $1}'`; sid=`echo $line | awk '{print $4}'`; sstart=`echo $line | awk '{print $5}'`; send=`echo $line | awk '{print $6}'`; Length=$((send - sstart)); fastafetch $GENOME $GENOMEINDEX $sid > scaffold.fa; fastasubseq scaffold.fa $sstart $Length > subsequence.fa; grep -A 1 ">"$qid $PROTEINFILE > queryFile.fa; exonerate -m p2g --showalignment FALSE --showtargetgff TRUE -q queryFile.fa -t subsequence.fa >> $qid".exonerate.gff"; done; unset IFS

IFS=$'\n'; for line in `cat $noOverlapTblastnOutput`; do qid=`echo $line | awk '{print $1}'`; sid=`echo $line | awk '{print $4}'`; sstart=`echo $line | awk '{print $5}'`; send=`echo $line | awk '{print $6}'`; Length=$((send - sstart)); fastafetch $GENOME $GENOMEINDEX $sid > scaffold.fa; fastasubseq scaffold.fa $sstart $Length > subsequence.fa; grep -A 1 ">"$qid $PROTEINFILE > queryFile.fa; egrep -w exon $qid".exonerate.gff" > $qid".exon.gff"; fastaseqfromGFF.pl subsequence.fa $qid".exon.gff" > $qid".pred.nuc.fa"; fastatranslate -F 1 $qid".pred.nuc.fa" > $qid".pred.nuc.aa.fa"; t_coffee $qid".pred.nuc.aa.fa" queryFile.fa > $qid".tcoffee"; done; unset IFS

Now I am lost in this sea of files as I need to find a non-redundant set of genes. Will anyone be kind enough to tell if the above pipeline is correct and if so how do I find the non-redundant set of genes in the following step.

Thanks in advance

gene annotation homology-based gene prediction • 1.7k views
Entering edit mode

Wouldn’t it be simpler to use an existing annotation pipeline? Even with it being non-model, most annotation tools should do a semi decent job I would expect.

Entering edit mode

Pardon my ignorance but I failed to find any pipeline explicitly for homology-based gene model prediction. The existing pipelines are mostly for de novo gene prediction, e.g. MAKER. The steps I am following roughly follows the steps given in most papers on draft genome/gene prediction (which do not of course mention every details, for brevity, I presume). Could you please suggest a pipeline and/or a paper that provides a detailed account of homology based gene- prediction? Thank you very much for your help.

Entering edit mode

Sorry for the late reply, this must have gotten buried in my notifications.

I can only give you limited examples, as I'm only familiar with microbial pipelines, but prokka, for example, can be given a list of 'trusted proteins' which to annotate from via homology in the first instance. It then iteratively uses Prodigal for de novo prediction of CDSs, and rounds of HMM/blast refinement to attribute annotations to them all.

Its a sort of hybrid approach. I'm still not totally clear why you only want to use homology bases methods though?

Entering edit mode
4.3 years ago

Hi bio_d,

did you check GeMoMa?

Latest paper: https://rdcu.be/QbKc

Homepage: http://www.jstacs.de/index.php/GeMoMa

GeMoMa is a homology-based gene prediction program. GeMoMa has several desirable features:

  • uses amino acid sequence conservation
  • uses intron position conservation
  • allows to use RNA-seq data
  • allows to use several reference organisms (in your case 3)
  • delivers a final prediction

Of course, there are also other homology-based gene prediction programs, as for instance, genBlastG or exonerate. However, GeMoMa performs favorably in a benchmark: https://nar.oxfordjournals.org/content/44/9/e89

Btw, I'm the main developer of GeMoMa.

Entering edit mode

Thank you, Jens. I will try your homology-based gene prediction program.


Login before adding your answer.

Traffic: 1473 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6