Question: convert BUSCO gff files to SNAP HMMs
1
gravatar for kaylamh6
22 months ago by
kaylamh610
United States
kaylamh610 wrote:

I'm would like to use the assemblage pipeline to annotate my genome, but I don't have access to the program CEGMA. I do, however, have access to BUSCO. In the pipeline you use the gff file produced by CEGMA to generate a SNAP HMM using the following code:

cegma2zff output.cegma.gff ../$genome
fathom genome.ann genome.dna -categorize 1000
fathom -export 1000 -plus uni.ann uni.dna
forge export.ann export.dna
hmm-assembler.pl $genome . > ../$genome.cegmasnap.hmm

I'm wondering whether a similar process can be undertaken using the gff files that BUSCO outputs. BUSCO generates a number of gff files, which I think correspond to the different contigs from your assembly within which BUSCO genes have been identified. I can't find a ton of documentation related to either the CEGMA or BUSCO gff files, and I know there can be differences depending on the gff version and program that produces the files. Has anyone tried using the BUSCO output to do something similar to this? Or is anyone knowledgable about the similarities/differences in the gff files generated by the two programs?

Thanks!

assemblage busco annotation cegma • 1.3k views
ADD COMMENTlink modified 21 months ago by jean.elbers720 • written 22 months ago by kaylamh610
2
gravatar for jean.elbers
21 months ago by
jean.elbers720
jean.elbers720 wrote:

Here is what I did for running BUSCO v3

python ~/bin/busco/scripts/run_BUSCO.py -i genome.fasta \
-o busco_test -l /eukaryota_odb9/ --mode genome --long

So here is what I did in order to convert gff files produced by augustus into a format that could be read by SNAP. Note that I used bedtools merge because fathom complained that some of the exons were overlapping. I am sure someone could write the code more efficiently and elegantly, but it suffices for the ~250 files I am working with.

cd /directory_with_augustus_gffs/
ls *.gff | perl -pe "s/.gff//g" > samples 
while read i;do
    file=$i
    grep "exon" $i.gff | sort -k 1,1 -k4,4n | ~/bin/bedtools-2.26.0/bin/bedtools merge |\
    awk -v OFS='\t' -v b="$file" '
    {if ($7 == "-" && NR == 1)
        {a="Terminal";print $1,$2,a,$4,$5,$6,$7,$8,b;}}
    {if ($7 == "-" && NR>1)
        {a="Internal";print $1,$2,a,$4,$5,$6,$7,$8,b;}}
    END {if ($7 == "-")
        {a="First";print $1,$2,a,$4,$5,$6,$7,$8,b;}}
    {if ($7 == "+" && NR==1)
        {a="First";print $1,$2,a,$4,$5,$6,$7,$8,b;}}
    {if ($7 == "+" && NR>1)
        {a="Internal";print $1,$2,a,$4,$5,$6,$7,$8,b;}}
    END {if ($7 == "+")
        {a="Terminal";print $1,$2,a,$4,$5,$6,$7,$8,b;}}' >> busco_augustus_all.gff
done < samples
#
~/bin/maker/bin/cegma2zff busco_augustus_all.gff genome.fasta
~/bin/maker/exe/snap/fathom genome.ann genome.dna -categorize 1000
~/bin/maker/exe/snap/fathom -export 1000 -plus uni.ann uni.dna
~/bin/maker/exe/snap/forge export.ann export.dna
~/bin/maker/exe/snap/hmm-assembler.pl drom . > ../drom.buscosnap.hmm

Here's an explanation of the code

cd /directory_with_augustus_gffs/
ls *.gff | perl -pe "s/.gff//g" > samples # get all files with the extension .gff, remove the extension with perl and then put them in a file called samples
while read i;do # for each line in samples, read in "i" or the sample name and do the following
    file=$i # create the variable file and assign "i" or the current sample name to it
    grep "exon" $i.gff | \ # get only the rows with exon in them
    sort -k 1,1 -k4,4n | \ # sort the file by chromosome then start position just in case
    ~/bin/bedtools-2.26.0/bin/bedtools merge |\ # merge overlapping exons if any
    awk -v OFS='\t' -v b="$file" ' # initiate awk, make output field separator tabs, and assign the variable $file (the current sample) to the variable b
    {if ($7 == "-" && NR == 1) 
        {a="Terminal";print $1,$2,a,$4,$5,$6,$7,$8,b;}} # if the sequence is on the minus strand and it is the first line, then rename "exon" to "Terminal"
    {if ($7 == "-" && NR>1)
        {a="Internal";print $1,$2,a,$4,$5,$6,$7,$8,b;}} # if the sequence is on the minus strand and it is not the first line, then rename "exon" to "Internal"
    END {if ($7 == "-")
        {a="First";print $1,$2,a,$4,$5,$6,$7,$8,b;}} # if the sequence is on the minus strand and it is the last line, then rename "exon" to "First"
    {if ($7 == "+" && NR==1)
        {a="First";print $1,$2,a,$4,$5,$6,$7,$8,b;}} # if the sequence is on the plus strand and it is the first line, then rename "exon" to "First"
    {if ($7 == "+" && NR>1)
        {a="Internal";print $1,$2,a,$4,$5,$6,$7,$8,b;}} 
    END {if ($7 == "+")
        {a="Terminal";print $1,$2,a,$4,$5,$6,$7,$8,b;}}' \ # if the sequence is on the minus strand and it is the last line, then rename "exon" to "Terminal"
>> busco_augustus_all.gff # append the output for each iteration to the file busco_augustus_all.gff
done < samples
#
~/bin/maker/bin/cegma2zff busco_augustus_all.gff genome.fasta
~/bin/maker/exe/snap/fathom genome.ann genome.dna -categorize 1000
~/bin/maker/exe/snap/fathom -export 1000 -plus uni.ann uni.dna
~/bin/maker/exe/snap/forge export.ann export.dna
~/bin/maker/exe/snap/hmm-assembler.pl drom . > ../drom.buscosnap.hmm

You should also see the following maker-devel post: https://groups.google.com/forum/#!topic/maker-devel/vp8R06VVQGQ

ADD COMMENTlink modified 21 months ago • written 21 months ago by jean.elbers720

This code cannot act as you say. Because gff3 output from BUSCO does not contain 'exon'. And 'bedtools merge' command wipe out several gff3 information like belows.

sc0000161 6923 7190

sc0000161 7623 8049

So, output using your code is empty. Some fix will be needed.

ADD REPLYlink modified 8 weeks ago • written 8 weeks ago by macarima0

I have the same problem.

I would like convert my busco file to CEGMA file or to zff.

ADD REPLYlink written 14 days ago by victorcana19910

You are right that there is problem in the code. In the Augustus GFF3 files that I tested the code on, there was a field containing exon. I need to figure out what version of Augustus and BUSCO I was using because I can't seem to find exon in the newer GFF3 files from BUSCO v.3.0.1 and Augustus 3.3.2. I fixed the bedtools merge and awk commands. I must have done testing without bedtools merge. Note that I haven't tested the newer code with the downstream tools cegma2zff and fathom forge and hmm-assembler.pl

cd /directory_with_augustus_gffs/
ls *.gff | perl -pe "s/.gff//g" > samples 
while read i;do
    file=$i
    grep "exon" $i.gff | bedtools sort |bedtools merge -s -c 7 -o distinct |\
    awk -v OFS='\t' -v b="$file" '
    {if ($4 == "-" && NR == 1)
        {a="Terminal";print $1,"AUGUSTUS",a,$2,$3,".",$4,".",b;}}
    {if ($4 == "-" && NR>1)
        {a="Internal";print $1,"AUGUSTUS",a,$2,$3,".",$4,".",b;}}
    END {if ($4 == "-")
        {a="First";print $1,"AUGUSTUS",a,$2,$3,".",$4,".",b;}}
    {if ($4 == "+" && NR==1)
        {a="First";print $1,"AUGUSTUS",a,$2,$3,".",$4,".",b;}}
    {if ($4 == "+" && NR>1)
        {a="Internal";print $1,"AUGUSTUS",a,$2,$3,".",$4,".",b;}}
    END {if ($4 == "+")
        {a="Terminal";print $1,"AUGUSTUS",a,$2,$3,".",$4,".",b;}}' >> busco_augustus_all.gff
done < samples
#
uniq -f 3 -u busco_augustus_all.gff > busco_augustus_all2.gff
uniq -f 3 -D busco_augustus_all.gff |sed -n 'n;p' > busco_augustus_all3.gff
cat busco_augustus_all2.gff busco_augustus_all3.gff |bedtools sort > busco_augustus_all4.gff

~/bin/maker/bin/cegma2zff busco_augustus_all4.gff genome.fasta
~/bin/maker/exe/snap/fathom genome.ann genome.dna -categorize 1000
~/bin/maker/exe/snap/fathom -export 1000 -plus uni.ann uni.dna
~/bin/maker/exe/snap/forge export.ann export.dna
~/bin/maker/exe/snap/hmm-assembler.pl drom . > ../drom.buscosnap.hmm

edit: I was using BUSCO v.3.0.1 and Augustus 3.2.3 to generate GFF3 files with exon in the column 3, type

edit2: I've tested the above with cegma2zff, but it doesn't seem to work with fathom (at least with SNAP version 2017-03-01).

ADD REPLYlink modified 13 days ago • written 13 days ago by jean.elbers720
1
gravatar for Charles Warden
22 months ago by
Charles Warden6.4k
Duarte, CA
Charles Warden6.4k wrote:

While I haven't directly used assemblage, I've done some testing with MAKER, where I tended to get better results if I used protein / EST sequences (or an RNA-Seq de novo assembly) rather than starting with de novo gene predictions. Since it looks like assemblage is a modification of the MAKER workflow, it might be worth noting that the relevant control parameter for making initial gene predictions with protein sequences is protein2genome=1 and EST sequences est2genome=1. On the link you provided, it looked like this was the recommended setting.

So, I can't really comment about CEGMA / BUSCO, but you may be able to use a reference set (or a reference set in a related organism) set to get a potentially better set of annotations. If there really is no appropriate reference set, you may need to change the MAKER configuration file.

Also, at least in my experience, I found it a little earlier to get MAKER working on my Mac over my Linux VM, but I had to run it with the -nolock option on a single core.

ADD COMMENTlink written 22 months ago by Charles Warden6.4k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1274 users visited in the last hour