Hello everyone!
I am a beginner with bioinformatics but at the company I work at we have a genome assembly of one of our crops. I wanted to annotate the genome and to do so I used a piece of python code in ubuntu. I used the Augustus Arabidopsis database and obtained a .gff file which I can visualize in IGV agains the reference genome.
Now I want to do a functional annotation with Blast. I think I should extract the protein sequences from my .gff file and that does not seem to work.
I hope to get some advice on how to do the functional annotation as well as if I am on the right track at all by using the Augustus for annotation.
The best shot I had at trying to make it work was the code below, however, after a couple of seconds of running I got ""Error during BLAST: Error message from NCBI: Message ID#29 Error: Query string not found in the CGI context "message id#29 error query string not found in the cgi context"
Thanks in advance!!
Screenshot from the .gff file when I make it into a .gff.txt file
This is the code I used at the id#29 error:
#!/usr/bin/env python
from Bio import SeqIO
from Bio.Blast import NCBIWWW, NCBIXML
def extract_protein_sequences(gff_file, genome_file):
    protein_sequences = {}
    with open(gff_file) as gff:
        for line in gff:
            if '\tCDS\t' in line:
                parts = line.split('\t')
                start, end, strand = int(parts[3]), int(parts[4]), parts[6]
                seq_id = parts[0]
                seq_type = parts[2]
                attributes = parts[8].split(';')
                gene_id = None
                for attr in attributes:
                    if attr.startswith('gene_id='):
                        gene_id = attr.split('=')[1]
                        break
                if gene_id is None:
                    continue
                protein_id = f"{seq_id}_{gene_id}_protein"
                if seq_type == 'CDS':
                    sequence = genome_file[seq_id][start-1:end]
                    if strand == '-':
                        sequence = sequence.reverse_complement()
                    protein_sequences[protein_id] = sequence.translate()
    return protein_sequences
def run_blast(protein_sequences):
    try:
        result_handle = NCBIWWW.qblast("blastp", "nr", "\n".join(str(seq) for seq in protein_sequences.values()))
        blast_records = NCBIXML.parse(result_handle)
        return blast_records
    except ValueError as ve:
        print("Error during BLAST:", ve)
        return []
def main(gff_file_path, genome_file_path, output_fasta_file_path):
    genome_file = SeqIO.to_dict(SeqIO.parse(genome_file_path, "fasta"))
    protein_sequences = extract_protein_sequences(gff_file_path, genome_file)
    blast_records = run_blast(protein_sequences)
    with open(output_fasta_file_path, "w") as output_file:
        output_file.write("Protein_ID\tDescription\tE-value\n")
        for record in blast_records:
            if record.alignments:
                alignment = record.alignments[0]
                title = alignment.title.split(" ", 1)
                description = title[1] if len(title) > 1 else ""
                e_value = alignment.hsps[0].expect
                output_file.write(f"{record.query}\t{description}\t{e_value}\n")
if __name__ == "__main__":
    gff_file = "/home/share/Genome_Annotation.gff"
    genome_file = "/home/share/Genome.fasta"
    output_fasta_file = "/home/share/Predicted_proteins.fasta"
    main(gff_file, genome_file, output_fasta_file)
Thanks a lot for helping biofalconch ! Do you think this will also make a correct header for the fasta file in order to be able to get nice visualisation in IGV?
I ran the code and I got this output, it is the same for every single line.
Any ideas on how to fix this?
If I recall correctly the fasta will have the
trasnscript_idas headerI think it might have to do how the gene and transcript entries don't have transcript_id or gene_id on the attribute field, just the names. You could try this to add them:
Just note that since I don't have the actual gff I have no idea if this will work :)
Thanks a lot for the help again!
I was trying something else in a different file with this code:
This gave me 3 output files: .aa, .codingseq and .mrna. The .aa file contained all the proteinsequences with headers like >g1.t1 up untill the last one which is something like >g11763.t1
I tried running this in blast but the file was too large, so next step I will try is running a local blast.I did try it with the first 2 sequences tho. This gave me homologs but when turning it into a .BED file, the .BED file did not localize with the .gff file and referencegenomen.fasta in IGV. So I still think that there is something wrong with the header or the information that it gives to the .BED file.