Question

convert BUSCO gff files to SNAP HMMs

1

Entering edit mode

8.2 years ago

kaylamh6 ▴ 10

I'm would like to use the assemblage pipeline to annotate my genome, but I don't have access to the program CEGMA. I do, however, have access to BUSCO. In the pipeline you use the gff file produced by CEGMA to generate a SNAP HMM using the following code:

cegma2zff output.cegma.gff ../$genome
fathom genome.ann genome.dna -categorize 1000
fathom -export 1000 -plus uni.ann uni.dna
forge export.ann export.dna
hmm-assembler.pl $genome . > ../$genome.cegmasnap.hmm

I'm wondering whether a similar process can be undertaken using the gff files that BUSCO outputs. BUSCO generates a number of gff files, which I think correspond to the different contigs from your assembly within which BUSCO genes have been identified. I can't find a ton of documentation related to either the CEGMA or BUSCO gff files, and I know there can be differences depending on the gff version and program that produces the files. Has anyone tried using the BUSCO output to do something similar to this? Or is anyone knowledgable about the similarities/differences in the gff files generated by the two programs?

Thanks!

BUSCO CEGMA assemblage annotation • 4.2k views

ADD COMMENT • link updated 8.0 years ago by jean.elbers ★ 1.7k • written 8.2 years ago by kaylamh6 ▴ 10

score 2 · Answer 1 · 2017-06-21

Here is what I did for running BUSCO v3

python ~/bin/busco/scripts/run_BUSCO.py -i genome.fasta \
-o busco_test -l /eukaryota_odb9/ --mode genome --long

So here is what I did in order to convert gff files produced by augustus into a format that could be read by SNAP. Note that I used bedtools merge because fathom complained that some of the exons were overlapping. I am sure someone could write the code more efficiently and elegantly, but it suffices for the ~250 files I am working with.

cd /directory_with_augustus_gffs/
ls *.gff | perl -pe "s/.gff//g" > samples 
while read i;do
    file=$i
    grep "exon" $i.gff | sort -k 1,1 -k4,4n | ~/bin/bedtools-2.26.0/bin/bedtools merge |\
    awk -v OFS='\t' -v b="$file" '
    {if ($7 == "-" && NR == 1)
        {a="Terminal";print $1,$2,a,$4,$5,$6,$7,$8,b;}}
    {if ($7 == "-" && NR>1)
        {a="Internal";print $1,$2,a,$4,$5,$6,$7,$8,b;}}
    END {if ($7 == "-")
        {a="First";print $1,$2,a,$4,$5,$6,$7,$8,b;}}
    {if ($7 == "+" && NR==1)
        {a="First";print $1,$2,a,$4,$5,$6,$7,$8,b;}}
    {if ($7 == "+" && NR>1)
        {a="Internal";print $1,$2,a,$4,$5,$6,$7,$8,b;}}
    END {if ($7 == "+")
        {a="Terminal";print $1,$2,a,$4,$5,$6,$7,$8,b;}}' >> busco_augustus_all.gff
done < samples
#
~/bin/maker/bin/cegma2zff busco_augustus_all.gff genome.fasta
~/bin/maker/exe/snap/fathom genome.ann genome.dna -categorize 1000
~/bin/maker/exe/snap/fathom -export 1000 -plus uni.ann uni.dna
~/bin/maker/exe/snap/forge export.ann export.dna
~/bin/maker/exe/snap/hmm-assembler.pl drom . > ../drom.buscosnap.hmm

Here's an explanation of the code

cd /directory_with_augustus_gffs/
ls *.gff | perl -pe "s/.gff//g" > samples # get all files with the extension .gff, remove the extension with perl and then put them in a file called samples
while read i;do # for each line in samples, read in "i" or the sample name and do the following
    file=$i # create the variable file and assign "i" or the current sample name to it
    grep "exon" $i.gff | \ # get only the rows with exon in them
    sort -k 1,1 -k4,4n | \ # sort the file by chromosome then start position just in case
    ~/bin/bedtools-2.26.0/bin/bedtools merge |\ # merge overlapping exons if any
    awk -v OFS='\t' -v b="$file" ' # initiate awk, make output field separator tabs, and assign the variable $file (the current sample) to the variable b
    {if ($7 == "-" && NR == 1) 
        {a="Terminal";print $1,$2,a,$4,$5,$6,$7,$8,b;}} # if the sequence is on the minus strand and it is the first line, then rename "exon" to "Terminal"
    {if ($7 == "-" && NR>1)
        {a="Internal";print $1,$2,a,$4,$5,$6,$7,$8,b;}} # if the sequence is on the minus strand and it is not the first line, then rename "exon" to "Internal"
    END {if ($7 == "-")
        {a="First";print $1,$2,a,$4,$5,$6,$7,$8,b;}} # if the sequence is on the minus strand and it is the last line, then rename "exon" to "First"
    {if ($7 == "+" && NR==1)
        {a="First";print $1,$2,a,$4,$5,$6,$7,$8,b;}} # if the sequence is on the plus strand and it is the first line, then rename "exon" to "First"
    {if ($7 == "+" && NR>1)
        {a="Internal";print $1,$2,a,$4,$5,$6,$7,$8,b;}} 
    END {if ($7 == "+")
        {a="Terminal";print $1,$2,a,$4,$5,$6,$7,$8,b;}}' \ # if the sequence is on the minus strand and it is the last line, then rename "exon" to "Terminal"
>> busco_augustus_all.gff # append the output for each iteration to the file busco_augustus_all.gff
done < samples
#
~/bin/maker/bin/cegma2zff busco_augustus_all.gff genome.fasta
~/bin/maker/exe/snap/fathom genome.ann genome.dna -categorize 1000
~/bin/maker/exe/snap/fathom -export 1000 -plus uni.ann uni.dna
~/bin/maker/exe/snap/forge export.ann export.dna
~/bin/maker/exe/snap/hmm-assembler.pl drom . > ../drom.buscosnap.hmm

You should also see the following maker-devel post: https://groups.google.com/forum/#!topic/maker-devel/vp8R06VVQGQ

score 1 · Answer 2 · 2017-05-08

While I haven't directly used assemblage, I've done some testing with MAKER, where I tended to get better results if I used protein / EST sequences (or an RNA-Seq de novo assembly) rather than starting with de novo gene predictions. Since it looks like assemblage is a modification of the MAKER workflow, it might be worth noting that the relevant control parameter for making initial gene predictions with protein sequences is protein2genome=1 and EST sequences est2genome=1. On the link you provided, it looked like this was the recommended setting.

So, I can't really comment about CEGMA / BUSCO, but you may be able to use a reference set (or a reference set in a related organism) set to get a potentially better set of annotations. If there really is no appropriate reference set, you may need to change the MAKER configuration file.

Also, at least in my experience, I found it a little earlier to get MAKER working on my Mac over my Linux VM, but I had to run it with the -nolock option on a single core.