Question: Roary producing erroneous plots?
gravatar for Lesley Sitter
22 months ago by
Lesley Sitter550
Lesley Sitter550 wrote:

Hi there, been using Roary for 3 years now... awesome tool!

I'm just curious how it is that all the plots it produces aren't single values? For example... the "Number of unique genes" plot should just be a fixed single value of the number of unique genes am i right? So how come it's a collection of whisker plots per genome, that sometimes even has outlier points? How can it have a range of values per genome unless it calculates it for different Nucleotide Identity values or something... It's not as if it also produced a range of gene_presence_absence.csv files.

I'm just curious if anyone knows what these plot ranges are based on

EDIT: i looked all over, but there is no explenation anywhere where the ranges come from. So ended up writing an extensive R script that just calculates it for you based on the gene_abscence_presence.csv file... but would still be interested in knowing the reason for this weird output if anyone knows

roary R • 731 views
ADD COMMENTlink modified 22 months ago by Joe18k • written 22 months ago by Lesley Sitter550

I've used roary a fair bit. None of the plots seemed unusual to me, but I'm struggling to picture them now. Can you show an example of the plot you mean specifically?

ADD REPLYlink written 22 months ago by Joe18k

For example, this is the default plot you get for New genes per genome... it's a whisker plot, meaning that each genome has a "range" of new genes... which is off course totally absurd unless there is some sort of "threshold" through which Roary analyzes these new genes (for example on a range of different Identity scores) New genes per genome

But the troubling one for me was this one, the "unique" genes per genome plot... Unique is singular... so i'm really confused where this range comes from Unique genes

ADD REPLYlink modified 22 months ago • written 22 months ago by Lesley Sitter550

And when i convert the gene_abscence_prescence.csv to binary matrix, score the number of rows that have single entries. then count the number of rows per genome that belong to a orthologous group with only 1 entry... my plot looks nothing like this... so even the values it should represent based on the absence presence matrix, are not in these plots :S

enter image description here

ADD REPLYlink modified 22 months ago • written 22 months ago by Lesley Sitter550
gravatar for Joe
22 months ago by
United Kingdom
Joe18k wrote:

I spoke to Andrew, the lead dev for the tool.

The reason they are box plots is because depending on which order you consider each new genome, the impact on the size of the core/accessory is different. So, all the genomes are randomly sampled N times, and the impact they have on the plots shown as a box plot/distribution.

ADD COMMENTlink written 22 months ago by Joe18k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2224 users visited in the last hour