Counting CDS's in Maker output
1
0
Entering edit mode
4.1 years ago
mrmrwinter ▴ 30

Hi

I have just ran Maker to annotate my genome.

where/how can i find how many CDS's it detected?

I have tried running into JBrowse with maker2jbrowse, but i cannot open the output in browser.

Thanks

annotation maker cds est gene prediction • 2.0k views
ADD COMMENT
1
Entering edit mode
4.1 years ago
Juke34 8.5k

Use GAAS and AGAT

conda install gaas agat

Then run from the folder you had run MAKER:

gaas_maker_merge_outputs_from_datastore.pl

in the output folder you will have statistics.

ADD COMMENT
0
Entering edit mode

Thanks Juke, this is exactly what i needed

ADD REPLY
0
Entering edit mode

Hi, ive ran what you siggested but do not understand the output. Which of these statistics is CDS's? Thanks

Number of matchs 259209

Number of protein_matchs 136451

Number of introns 928478

Number of match_parts 1324138

Number of single exon match 259209

Number of single exon protein_match 136451

mean introns per match 3.6

mean introns per protein_match 6.8

mean match_parts per match 5.1

mean match_parts per protein_match 9.7

Total match length 207225956

Total protein_match length 235781736

Total intron length 222493540

Total match_part length 221019217

mean match length 799

mean protein_match length 1727

mean intron length 239

mean match_part length 166

Longest matchs 30330

Longest protein_matchs 38788

Longest introns 19827

Longest match_parts 5271

Shortest matchs 1

Shortest protein_matchs 48

hortest introns -4598

Shortest match_parts 1

Apparently we have isoforms : Number of level1 features: 395660 / Number of level2 features: 1324138 We will proceed to the statistics analysis using only the mRNA with the longest cds Number of matchs 91910

Number of protein_matchs 107719

Number of match_parts 199629

Number of single exon match 91910

Number of single exon protein_match 107719

mean match_parts per match 2.2

mean match_parts per protein_match 1.9

Total match length 185553786

Total protein_match length 227698299

Total match_part length 34729218

mean match length 2018

mean protein_match length 2113

mean match_part length 173

Longest matchs 30330

Longest protein_matchs 38788

Longest match_parts 4467

Shortest matchs 42

Shortest protein_matchs 81

Shortest match_parts 9

ADD REPLY
1
Entering edit mode

You should get a folder called maker_output_processed_genomeName_evidence/ and inside a file Called maker_annotation_stat.txt. Is is that file you are looking at? Did you run anything else then gaas_maker_merge_outputs_from_datastore.pl?

ADD REPLY
0
Entering edit mode

Yes i got this file and folder

the wall of text above is the contents of maker-annotation_stats.txt

No, i didnt do anything thing else other than gaas_maker_merge_outputs_from_datastore.pl

ADD REPLY
1
Entering edit mode

Interesting, I tried the script only on MAKER3 output , maybe you used another version of MAKER. I haven't try on OSX neither, did you run it on OSX?

Anyway... All data generated by maker are gathered in the file maker_mix.gff by gaas_maker_merge_outputs_from_datastore.pl.
Similarly you can generate this file using gff3_merge from MAKER : gff3_merge -s -d genome.maker.output/genome_master_datastore_index.log > maker_mix.gff

If you want to know the number of CDS you can just do awk '{if ($3 == "CDS") a++}END{print a}' maker_mix.gff

If you want detailed statistics you must first filter maker_mix.gff file to keep only the gene models ( remove repeats, protein alignment, etc. ):
awk '{if ($2 == "maker") print $0 }' maker_mix.gff > maker_annotation.gff

and then run agat_sp_statistics.pl

ADD REPLY
0
Entering edit mode

Im running MAKER 2.31. I wasnt aware there was a MAKER3. I am installing it through bioconda though, which may be causing issues.

And no, im running it on a Centos HPC.

awk '{if ($3 == "CDS") a++}END{print a}' maker_mix.gff didnt work, and a quick grep count for "CDS" came back with 0, but the file looks like a standard gff. Some entries in column 3 are "expressed_sequence_match" or "protein match", and using the awk on those gives back a result.

awk '{if ($2 == "maker") print $0 }' maker_mix.gff > maker_annotation.gff and agat_sp_statistics.pl returned an empty file. grep -c gave no instances of maker in maker_mix.gff either.

Looks like we're getting differently formatted outputs. Ill see about installing manually rather than conda.

ADD REPLY
1
Entering edit mode

It sounds you don’t have any annotation (gene models) in this file. Could you try to generate it with gff3_merge too? If still no CDS in it it means you didn’t predict anything. To perform an evidence-based prediction you must activate est2genome or/and protein2genome parameter. For an abinitio one you must provide an Hmm profile to Augustus/snap/... parameter file and deactivate est2genome and protein2genome.

ADD REPLY
0
Entering edit mode

This is exactly what i've done....

I was so sure i'd set them i didn't think to check. Rerunning now.

Thanks for all the help

ADD REPLY
1
Entering edit mode

I have checked on output of MAKER2 (MAKER 2.31.8) it works perfectly fine for me. I don't understand... In the maker_mix.gff file generated by gaas_maker_merge_outputs_from_datastore.pl do you see any feature with maker as source (2nd line of a gff):

awk '{if($2 ~ /[a-zA-Z]+/ && $2=="maker") print $0}' maker_mix.gff.

While I never seen that it sounds you have lines like that:

chr1 maker match 993996 994793 . - . ID=a chr1 maker match_part 993996 994793 . - . ID=b

At least from MAKER 2.31.8, lines with maker as source (2nd column) are supposed to be gene models and have gene,mRNA,exon,CDS... as feature type (3rd column). Maybe you use an older version (<2.31.8)

ADD REPLY
0
Entering edit mode

I didnt set est2genome and protein2genome on (see my other comment)

grep finds no instances of maker in the maker_mix.gff file

ADD REPLY

Login before adding your answer.

Traffic: 1964 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6