Too many sequences of EVM output
2
0
Entering edit mode
5.9 years ago
Ginsea Chen ▴ 130

Dear all.

I used EVM to intergate all evidences (de novo, RNA-seq and protein evidences) into a single gff3 file (evm.out.gff3), and then found that too many sequences were contained in gff3 file. So my question is how to filter these sequences and then produce available gene models like lots of genome sequencing project.

Thanks all!

genome EVM • 2.7k views
1
Entering edit mode
3.7 years ago

I know this is older but now I had the same problem, this is how I solved some of it.

After this command:

$EVM_HOME/EvmUtils/recombine_EVM_partial_outputs.pl --partitions partitions_list.out --output_file_name evm.out  You'll have one folder for each scaffold/pseudomolecule. In there is a file called evm.out which looks perhaps like this: # EVM prediction: Mode:STANDARD S-ratio: 3.86 1711-2833 orient(+) score(2726.00) noncoding_equivalent(706.17) raw_noncoding(959.00) offset(252.83) 1711 1811 initial+ 1 2 {Augustus_C1-g1.t1;Augustus},{GeneMark.hmm_9038_t;GeneMark.hmm} 1812 1936 INTRON {Augustus_C1-g1.t1;Augustus} 1937 2288 internal+ 3 3 {Augustus_C1-g1.t1;Augustus},{Augustus_C1-g1.t2;Augustus} 2289 2365 INTRON {Augustus_C1-g1.t1;Augustus},{Augustus_C1-g1.t2;Augustus} 2366 2833 terminal+ 1 3 {Augustus_C1-g1.t1;Augustus},{Augustus_C1-g1.t2;Augustus} # EVM prediction: Mode:STANDARD S-ratio: 6.24 2875-4096 orient(+) score(5809.00) noncoding_equivalent(930.83) raw_noncoding(1741.00) offset(810.17) 2875 3002 initial+ 1 2 {Augustus_C1-g2.t1;Augustus} 3003 3079 INTRON {Augustus_C1-g2.t1;Augustus} 3080 3267 internal+ 3 1 {Augustus_C1-g2.t1;Augustus} 3268 3880 INTRON {Augustus_C1-g2.t1;Augustus} 3881 3920 internal+ 2 2 {Augustus_C1-g2.t1;Augustus} 3921 4089 INTRON {Augustus_C1-g2.t1;Augustus},{ev_type:Cufflinks/ID=MSTRG.2.1/Target=cuff-MSTRG.2.1;Cufflinks},{ev_type:Cufflinks/ID=MSTRG.2.5/Target=cuff-MSTRG.2.5;Cufflinks},{ev_type:Cufflinks/ID=MSTRG.2.8/Target=cuff-MSTRG.2.8;Cufflinks} 4090 4096 terminal+ 3 3 {Augustus_C1-g2.t1;Augustus}  We have two genes here, only one of them is partially supported by RNASeq. You 'could' filter the evm.out by the score, but I've noticed that genes with many exons and no RNASeq support have large scores, too. I wrote a small Python script which iterates over all evm.out files and removes predictions that have NO RNAseq in them, in my case, a line containing 'Cufflinks' because that's what I called my RNASeq evidence in the weights.txt and StringTie output gff3: I'm running this script in the root directory of my EvidenceModeler output (the folder that has one folder for each scaffold/pseudomolecule). This makes a ton of new files ending in nonRNARemoved.out. Then I checked whether I was happy with those files (for example, are they all empty?), then made backups of my original evm.out files: find . -maxdepth 2 -name evm.out -exec mv {} {}.bak \;  renamed my filtered files to evm.out: for l in$(find . -maxdepth 2 -name evm_nonRNARemoved.out); do mv $l${l%%_nonRNARemoved.out}.out; done


and ran the gff3 conversion step from the EvidenceModeler manual:

\$EVM_HOME/EvmUtils/convert_EVM_outputs_to_GFF3.pl  --partitions partitions_list.out --output evm.out  --genome genome.fasta


and concatenated the gff3 files:

find . -regex ".*evm.out.gff3" -exec cat {} \; > EVM.all_only_with_RNASeq_evidence.gff3


The output gff3 is weird and still contains the 'filtered' genes - in the second column, filtered genes have a '.', unfiltered genes have an 'EVM'.

In my case, this almost halved the gene predictions. This is kiiiiind of like filtering for AED < 1 in MAKER, but only kind of.

You could make this fancier - in my case, only one of the introns has to be supported by RNASeq, and even in that case it doesn't matter how well supported the intron is. You could be more stringent and ask to have all introns supported by RNASeq which will filter stuff even more. I leave that as an exercise to the reader :)

Other things to try:

• actually filter by score. In that case, I'd start by plotting the scores in a histogram. What are noncoding_equivalent etc.? I can't find those values described anywhere. You wouldn't need to change much in the above script, don't need to check for Cufflinks but parse out the score from the comment line

• filter the final gff3 by other ways, for example, get rid of genes without start or stop codons. You'd have to run gffread to get the CDS/AA I guess?

0
Entering edit mode
4.7 years ago

hello ,my friends,can i talk with you about the evm?

0
Entering edit mode