I have annotated several genomes using the Maker2 pipeline with the goal of estimating dN/dS ratios for many genes. I have the gff files, and I would like to extract just the coding sequences into a fasta file. Previously I have been using the
fasta_merge script that comes with maker, but I just noticed that the nucleotide sequences that it outputs includes the 5' and 3' UTRs and they are not always in the correct reading frame, which leads to some alignment issues in downstream steps.
Is there a script similar to
fasta_merge but which will output CDS sequences in-frame (cutting off incomplete codons) and without UTRs?
Thank you so much!
agat_sp_extract_sequences.pllooks like exactly what I need!
Just in case anybody else comes across this post with the same problem: this worked great but the GFF output from Maker2 needed to be filtered first. Parsing the gff output from maker2 was extremely slow (took >53 hours), but then it took only 3 minutes on a filtered gff. I needed to remove unnecessary gff entries that maker2 had included (match, expressed_sequence_match, protein_match, contig).
And also the genome sequence needed to be folded to avoid errors:
Then I could run:
This code has been helpful but i have a question. After running these lines I have the
output_cds.fasta, but few of the sequences begin with a start codon. Is this normal?
hmmm, no and yes ...
in the perfect (theoretical) case: no all CDS should start with a start codon.
In real life case: possible, Maker is known to predict genes without start codon (due to reasons that will take us too far) , so yes in that case it is possible to end up with CDS without start.
But this is solely determined by the info from the GFF file, that is nothing that AGAT will do/change. So the only way to (re)solve this is to amend the info in the GFF file so that the predictions start at a start codon position