Question: Counting CDS's in Maker output
0
gravatar for mrmrwinter
12 weeks ago by
mrmrwinter0
University of Hull
mrmrwinter0 wrote:

Hi

I have just ran Maker to annotate my genome.

where/how can i find how many CDS's it detected?

I have tried running into JBrowse with maker2jbrowse, but i cannot open the output in browser.

Thanks

ADD COMMENTlink modified 12 weeks ago by Juke344.1k • written 12 weeks ago by mrmrwinter0
1
gravatar for Juke34
12 weeks ago by
Juke344.1k
Sweden
Juke344.1k wrote:

Use GAAS and AGAT

conda install gaas agat

Then run from the folder you had run MAKER:

gaas_maker_merge_outputs_from_datastore.pl

in the output folder you will have statistics.

ADD COMMENTlink written 12 weeks ago by Juke344.1k

Thanks Juke, this is exactly what i needed

ADD REPLYlink written 12 weeks ago by mrmrwinter0

Hi, ive ran what you siggested but do not understand the output. Which of these statistics is CDS's? Thanks

Number of matchs 259209

Number of protein_matchs 136451

Number of introns 928478

Number of match_parts 1324138

Number of single exon match 259209

Number of single exon protein_match 136451

mean introns per match 3.6

mean introns per protein_match 6.8

mean match_parts per match 5.1

mean match_parts per protein_match 9.7

Total match length 207225956

Total protein_match length 235781736

Total intron length 222493540

Total match_part length 221019217

mean match length 799

mean protein_match length 1727

mean intron length 239

mean match_part length 166

Longest matchs 30330

Longest protein_matchs 38788

Longest introns 19827

Longest match_parts 5271

Shortest matchs 1

Shortest protein_matchs 48

hortest introns -4598

Shortest match_parts 1

Apparently we have isoforms : Number of level1 features: 395660 / Number of level2 features: 1324138 We will proceed to the statistics analysis using only the mRNA with the longest cds Number of matchs 91910

Number of protein_matchs 107719

Number of match_parts 199629

Number of single exon match 91910

Number of single exon protein_match 107719

mean match_parts per match 2.2

mean match_parts per protein_match 1.9

Total match length 185553786

Total protein_match length 227698299

Total match_part length 34729218

mean match length 2018

mean protein_match length 2113

mean match_part length 173

Longest matchs 30330

Longest protein_matchs 38788

Longest match_parts 4467

Shortest matchs 42

Shortest protein_matchs 81

Shortest match_parts 9

ADD REPLYlink modified 12 weeks ago • written 12 weeks ago by mrmrwinter0
1

You should get a folder called maker_output_processed_genomeName_evidence/ and inside a file Called maker_annotation_stat.txt. Is is that file you are looking at? Did you run anything else then gaas_maker_merge_outputs_from_datastore.pl?

ADD REPLYlink written 12 weeks ago by Juke344.1k

Yes i got this file and folder

the wall of text above is the contents of maker-annotation_stats.txt

No, i didnt do anything thing else other than gaas_maker_merge_outputs_from_datastore.pl

ADD REPLYlink modified 12 weeks ago • written 12 weeks ago by mrmrwinter0
1

Interesting, I tried the script only on MAKER3 output , maybe you used another version of MAKER. I haven't try on OSX neither, did you run it on OSX?

Anyway... All data generated by maker are gathered in the file maker_mix.gff by gaas_maker_merge_outputs_from_datastore.pl.
Similarly you can generate this file using gff3_merge from MAKER : gff3_merge -s -d genome.maker.output/genome_master_datastore_index.log > maker_mix.gff

If you want to know the number of CDS you can just do awk '{if ($3 == "CDS") a++}END{print a}' maker_mix.gff

If you want detailed statistics you must first filter maker_mix.gff file to keep only the gene models ( remove repeats, protein alignment, etc. ):
awk '{if ($2 == "maker") print $0 }' maker_mix.gff > maker_annotation.gff

and then run agat_sp_statistics.pl

ADD REPLYlink modified 12 weeks ago • written 12 weeks ago by Juke344.1k

Im running MAKER 2.31. I wasnt aware there was a MAKER3. I am installing it through bioconda though, which may be causing issues.

And no, im running it on a Centos HPC.

awk '{if ($3 == "CDS") a++}END{print a}' maker_mix.gff didnt work, and a quick grep count for "CDS" came back with 0, but the file looks like a standard gff. Some entries in column 3 are "expressed_sequence_match" or "protein match", and using the awk on those gives back a result.

awk '{if ($2 == "maker") print $0 }' maker_mix.gff > maker_annotation.gff and agat_sp_statistics.pl returned an empty file. grep -c gave no instances of maker in maker_mix.gff either.

Looks like we're getting differently formatted outputs. Ill see about installing manually rather than conda.

ADD REPLYlink modified 12 weeks ago • written 12 weeks ago by mrmrwinter0
1

It sounds you don’t have any annotation (gene models) in this file. Could you try to generate it with gff3_merge too? If still no CDS in it it means you didn’t predict anything. To perform an evidence-based prediction you must activate est2genome or/and protein2genome parameter. For an abinitio one you must provide an Hmm profile to Augustus/snap/... parameter file and deactivate est2genome and protein2genome.

ADD REPLYlink written 12 weeks ago by Juke344.1k

This is exactly what i've done....

I was so sure i'd set them i didn't think to check. Rerunning now.

Thanks for all the help

ADD REPLYlink written 12 weeks ago by mrmrwinter0
1

I have checked on output of MAKER2 (MAKER 2.31.8) it works perfectly fine for me. I don't understand... In the maker_mix.gff file generated by gaas_maker_merge_outputs_from_datastore.pl do you see any feature with maker as source (2nd line of a gff):

awk '{if($2 ~ /[a-zA-Z]+/ && $2=="maker") print $0}' maker_mix.gff.

While I never seen that it sounds you have lines like that:

chr1 maker match 993996 994793 . - . ID=a chr1 maker match_part 993996 994793 . - . ID=b

At least from MAKER 2.31.8, lines with maker as source (2nd column) are supposed to be gene models and have gene,mRNA,exon,CDS... as feature type (3rd column). Maybe you use an older version (<2.31.8)

ADD REPLYlink written 12 weeks ago by Juke344.1k

I didnt set est2genome and protein2genome on (see my other comment)

grep finds no instances of maker in the maker_mix.gff file

ADD REPLYlink written 12 weeks ago by mrmrwinter0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1212 users visited in the last hour