Question

Extract CDS from maker gff

0

Entering edit mode

3.1 years ago

Sarah ▴ 60

Hello,

I have annotated several genomes using the Maker2 pipeline with the goal of estimating dN/dS ratios for many genes. I have the gff files, and I would like to extract just the coding sequences into a fasta file. Previously I have been using the fasta_merge script that comes with maker, but I just noticed that the nucleotide sequences that it outputs includes the 5' and 3' UTRs and they are not always in the correct reading frame, which leads to some alignment issues in downstream steps. Is there a script similar to fasta_merge but which will output CDS sequences in-frame (cutting off incomplete codons) and without UTRs?

Thank you!

CDS maker2 annotation gff • 2.3k views

ADD COMMENT • link updated 16 months ago by lieven.sterck 15k • written 3.1 years ago by Sarah ▴ 60

1

Entering edit mode

3.1 years ago

prasundutta87 ▴ 670

This tool will be helpful: https://gffutils.readthedocs.io/en/v0.9.1/GTF_extract.html

You can test "--feature=FEATURE_TYPE" parameter to limit yourself to CDS.

ADD COMMENT • link 3.1 years ago by prasundutta87 ▴ 670

0

Entering edit mode

Thank you!

ADD REPLY • link 3.1 years ago by Sarah ▴ 60

score 3 · Accepted Answer · 2021-06-28

3

Entering edit mode

3.1 years ago

lieven.sterck 15k

Have a look at the AGAT package. That will certainly include some script to extract features from a GFF file.

more info here and specifically this one

ADD COMMENT • link 3.1 years ago by lieven.sterck 15k

1

Entering edit mode

Thank you so much! agat_sp_extract_sequences.pl looks like exactly what I need!

ADD REPLY • link 3.1 years ago by Sarah ▴ 60

1

Entering edit mode

Just in case anybody else comes across this post with the same problem: this worked great but the GFF output from Maker2 needed to be filtered first. Parsing the gff output from maker2 was extremely slow (took >53 hours), but then it took only 3 minutes on a filtered gff. I needed to remove unnecessary gff entries that maker2 had included (match, expressed_sequence_match, protein_match, contig).

For example:

grep -Pv "\tmatch_part\t" Maker2_output.gff | grep -Pv "\tprotein_match\t" | grep -Pv "\texpressed_sequence_match\t" | grep -Pv "\tmatch\t" | grep -Pv "\tcontig\t" > streamlined_for_AGAT.gff

And also the genome sequence needed to be folded to avoid errors:

fold genome.fasta > genome_folded.fasta

Then I could run:

 agat_sp_extract_sequences.pl --gff streamlined_for_AGAT.gff -f genome_folded.fasta -t cds --remove_orf_offset -o output_cds.fasta

ADD REPLY • link 3.1 years ago by Sarah ▴ 60

0

Entering edit mode

This code has been helpful but i have a question. After running these lines I have the output_cds.fasta, but few of the sequences begin with a start codon. Is this normal?

ADD REPLY • link 16 months ago by mrmrwinter ▴ 30

0

Entering edit mode

hmmm, no and yes ...

in the perfect (theoretical) case: no all CDS should start with a start codon.

In real life case: possible, Maker is known to predict genes without start codon (due to reasons that will take us too far) , so yes in that case it is possible to end up with CDS without start.

But this is solely determined by the info from the GFF file, that is nothing that AGAT will do/change. So the only way to (re)solve this is to amend the info in the GFF file so that the predictions start at a start codon position

ADD REPLY • link 16 months ago by lieven.sterck 15k