convert BUSCO gff files to SNAP HMMs
2
1
Entering edit mode
7.0 years ago
kaylamh6 ▴ 10

I'm would like to use the assemblage pipeline to annotate my genome, but I don't have access to the program CEGMA. I do, however, have access to BUSCO. In the pipeline you use the gff file produced by CEGMA to generate a SNAP HMM using the following code:

cegma2zff output.cegma.gff ../$genome
fathom genome.ann genome.dna -categorize 1000
fathom -export 1000 -plus uni.ann uni.dna
forge export.ann export.dna
hmm-assembler.pl $genome . > ../$genome.cegmasnap.hmm

I'm wondering whether a similar process can be undertaken using the gff files that BUSCO outputs. BUSCO generates a number of gff files, which I think correspond to the different contigs from your assembly within which BUSCO genes have been identified. I can't find a ton of documentation related to either the CEGMA or BUSCO gff files, and I know there can be differences depending on the gff version and program that produces the files. Has anyone tried using the BUSCO output to do something similar to this? Or is anyone knowledgable about the similarities/differences in the gff files generated by the two programs?

Thanks!

BUSCO CEGMA assemblage annotation • 3.6k views
ADD COMMENT
2
Entering edit mode
6.8 years ago
jean.elbers ★ 1.7k

Here is what I did for running BUSCO v3

python ~/bin/busco/scripts/run_BUSCO.py -i genome.fasta \
-o busco_test -l /eukaryota_odb9/ --mode genome --long

So here is what I did in order to convert gff files produced by augustus into a format that could be read by SNAP. Note that I used bedtools merge because fathom complained that some of the exons were overlapping. I am sure someone could write the code more efficiently and elegantly, but it suffices for the ~250 files I am working with.

cd /directory_with_augustus_gffs/
ls *.gff | perl -pe "s/.gff//g" > samples 
while read i;do
    file=$i
    grep "exon" $i.gff | sort -k 1,1 -k4,4n | ~/bin/bedtools-2.26.0/bin/bedtools merge |\
    awk -v OFS='\t' -v b="$file" '
    {if ($7 == "-" && NR == 1)
        {a="Terminal";print $1,$2,a,$4,$5,$6,$7,$8,b;}}
    {if ($7 == "-" && NR>1)
        {a="Internal";print $1,$2,a,$4,$5,$6,$7,$8,b;}}
    END {if ($7 == "-")
        {a="First";print $1,$2,a,$4,$5,$6,$7,$8,b;}}
    {if ($7 == "+" && NR==1)
        {a="First";print $1,$2,a,$4,$5,$6,$7,$8,b;}}
    {if ($7 == "+" && NR>1)
        {a="Internal";print $1,$2,a,$4,$5,$6,$7,$8,b;}}
    END {if ($7 == "+")
        {a="Terminal";print $1,$2,a,$4,$5,$6,$7,$8,b;}}' >> busco_augustus_all.gff
done < samples
#
~/bin/maker/bin/cegma2zff busco_augustus_all.gff genome.fasta
~/bin/maker/exe/snap/fathom genome.ann genome.dna -categorize 1000
~/bin/maker/exe/snap/fathom -export 1000 -plus uni.ann uni.dna
~/bin/maker/exe/snap/forge export.ann export.dna
~/bin/maker/exe/snap/hmm-assembler.pl drom . > ../drom.buscosnap.hmm

Here's an explanation of the code

cd /directory_with_augustus_gffs/
ls *.gff | perl -pe "s/.gff//g" > samples # get all files with the extension .gff, remove the extension with perl and then put them in a file called samples
while read i;do # for each line in samples, read in "i" or the sample name and do the following
    file=$i # create the variable file and assign "i" or the current sample name to it
    grep "exon" $i.gff | \ # get only the rows with exon in them
    sort -k 1,1 -k4,4n | \ # sort the file by chromosome then start position just in case
    ~/bin/bedtools-2.26.0/bin/bedtools merge |\ # merge overlapping exons if any
    awk -v OFS='\t' -v b="$file" ' # initiate awk, make output field separator tabs, and assign the variable $file (the current sample) to the variable b
    {if ($7 == "-" && NR == 1) 
        {a="Terminal";print $1,$2,a,$4,$5,$6,$7,$8,b;}} # if the sequence is on the minus strand and it is the first line, then rename "exon" to "Terminal"
    {if ($7 == "-" && NR>1)
        {a="Internal";print $1,$2,a,$4,$5,$6,$7,$8,b;}} # if the sequence is on the minus strand and it is not the first line, then rename "exon" to "Internal"
    END {if ($7 == "-")
        {a="First";print $1,$2,a,$4,$5,$6,$7,$8,b;}} # if the sequence is on the minus strand and it is the last line, then rename "exon" to "First"
    {if ($7 == "+" && NR==1)
        {a="First";print $1,$2,a,$4,$5,$6,$7,$8,b;}} # if the sequence is on the plus strand and it is the first line, then rename "exon" to "First"
    {if ($7 == "+" && NR>1)
        {a="Internal";print $1,$2,a,$4,$5,$6,$7,$8,b;}} 
    END {if ($7 == "+")
        {a="Terminal";print $1,$2,a,$4,$5,$6,$7,$8,b;}}' \ # if the sequence is on the minus strand and it is the last line, then rename "exon" to "Terminal"
>> busco_augustus_all.gff # append the output for each iteration to the file busco_augustus_all.gff
done < samples
#
~/bin/maker/bin/cegma2zff busco_augustus_all.gff genome.fasta
~/bin/maker/exe/snap/fathom genome.ann genome.dna -categorize 1000
~/bin/maker/exe/snap/fathom -export 1000 -plus uni.ann uni.dna
~/bin/maker/exe/snap/forge export.ann export.dna
~/bin/maker/exe/snap/hmm-assembler.pl drom . > ../drom.buscosnap.hmm

You should also see the following maker-devel post: https://groups.google.com/forum/#!topic/maker-devel/vp8R06VVQGQ

ADD COMMENT
0
Entering edit mode

This code cannot act as you say. Because gff3 output from BUSCO does not contain 'exon'. And 'bedtools merge' command wipe out several gff3 information like belows.

sc0000161 6923 7190

sc0000161 7623 8049

So, output using your code is empty. Some fix will be needed.

ADD REPLY
0
Entering edit mode

I have the same problem.

I would like convert my busco file to CEGMA file or to zff.

ADD REPLY
0
Entering edit mode

You are right that there is problem in the code. In the Augustus GFF3 files that I tested the code on, there was a field containing exon. I need to figure out what version of Augustus and BUSCO I was using because I can't seem to find exon in the newer GFF3 files from BUSCO v.3.0.1 and Augustus 3.3.2. I fixed the bedtools merge and awk commands. I must have done testing without bedtools merge. Note that I haven't tested the newer code with the downstream tools cegma2zff and fathom forge and hmm-assembler.pl

cd /directory_with_augustus_gffs/
ls *.gff | perl -pe "s/.gff//g" > samples 
while read i;do
    file=$i
    grep "exon" $i.gff | bedtools sort |bedtools merge -s -c 7 -o distinct |\
    awk -v OFS='\t' -v b="$file" '
    {if ($4 == "-" && NR == 1)
        {a="Terminal";print $1,"AUGUSTUS",a,$2,$3,".",$4,".",b;}}
    {if ($4 == "-" && NR>1)
        {a="Internal";print $1,"AUGUSTUS",a,$2,$3,".",$4,".",b;}}
    END {if ($4 == "-")
        {a="First";print $1,"AUGUSTUS",a,$2,$3,".",$4,".",b;}}
    {if ($4 == "+" && NR==1)
        {a="First";print $1,"AUGUSTUS",a,$2,$3,".",$4,".",b;}}
    {if ($4 == "+" && NR>1)
        {a="Internal";print $1,"AUGUSTUS",a,$2,$3,".",$4,".",b;}}
    END {if ($4 == "+")
        {a="Terminal";print $1,"AUGUSTUS",a,$2,$3,".",$4,".",b;}}' >> busco_augustus_all.gff
done < samples
#
uniq -f 3 -u busco_augustus_all.gff > busco_augustus_all2.gff
uniq -f 3 -D busco_augustus_all.gff |sed -n 'n;p' > busco_augustus_all3.gff
cat busco_augustus_all2.gff busco_augustus_all3.gff |bedtools sort > busco_augustus_all4.gff

~/bin/maker/bin/cegma2zff busco_augustus_all4.gff genome.fasta
~/bin/maker/exe/snap/fathom genome.ann genome.dna -categorize 1000
~/bin/maker/exe/snap/fathom -export 1000 -plus uni.ann uni.dna
~/bin/maker/exe/snap/forge export.ann export.dna
~/bin/maker/exe/snap/hmm-assembler.pl drom . > ../drom.buscosnap.hmm

edit: I was using BUSCO v.3.0.1 and Augustus 3.2.3 to generate GFF3 files with exon in the column 3, type

edit2: I've tested the above with cegma2zff, but it doesn't seem to work with fathom (at least with SNAP version 2017-03-01).

ADD REPLY
1
Entering edit mode
7.0 years ago

While I haven't directly used assemblage, I've done some testing with MAKER, where I tended to get better results if I used protein / EST sequences (or an RNA-Seq de novo assembly) rather than starting with de novo gene predictions. Since it looks like assemblage is a modification of the MAKER workflow, it might be worth noting that the relevant control parameter for making initial gene predictions with protein sequences is protein2genome=1 and EST sequences est2genome=1. On the link you provided, it looked like this was the recommended setting.

So, I can't really comment about CEGMA / BUSCO, but you may be able to use a reference set (or a reference set in a related organism) set to get a potentially better set of annotations. If there really is no appropriate reference set, you may need to change the MAKER configuration file.

Also, at least in my experience, I found it a little earlier to get MAKER working on my Mac over my Linux VM, but I had to run it with the -nolock option on a single core.

ADD COMMENT

Login before adding your answer.

Traffic: 1578 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6