Question: Roary producing erroneous plots?
0
gravatar for Lesley Sitter
18 months ago by
Lesley Sitter540
Netherlands
Lesley Sitter540 wrote:

Hi there, been using Roary for 3 years now... awesome tool!

I'm just curious how it is that all the plots it produces aren't single values? For example... the "Number of unique genes" plot should just be a fixed single value of the number of unique genes am i right? So how come it's a collection of whisker plots per genome, that sometimes even has outlier points? How can it have a range of values per genome unless it calculates it for different Nucleotide Identity values or something... It's not as if it also produced a range of gene_presence_absence.csv files.

I'm just curious if anyone knows what these plot ranges are based on

EDIT: i looked all over, but there is no explenation anywhere where the ranges come from. So ended up writing an extensive R script that just calculates it for you based on the gene_abscence_presence.csv file... but would still be interested in knowing the reason for this weird output if anyone knows

roary R • 634 views
ADD COMMENTlink modified 18 months ago by Joe18k • written 18 months ago by Lesley Sitter540

I've used roary a fair bit. None of the plots seemed unusual to me, but I'm struggling to picture them now. Can you show an example of the plot you mean specifically?

ADD REPLYlink written 18 months ago by Joe18k

For example, this is the default plot you get for New genes per genome... it's a whisker plot, meaning that each genome has a "range" of new genes... which is off course totally absurd unless there is some sort of "threshold" through which Roary analyzes these new genes (for example on a range of different Identity scores) New genes per genome

But the troubling one for me was this one, the "unique" genes per genome plot... Unique is singular... so i'm really confused where this range comes from Unique genes

ADD REPLYlink modified 18 months ago • written 18 months ago by Lesley Sitter540

And when i convert the gene_abscence_prescence.csv to binary matrix, score the number of rows that have single entries. then count the number of rows per genome that belong to a orthologous group with only 1 entry... my plot looks nothing like this... so even the values it should represent based on the absence presence matrix, are not in these plots :S

enter image description here

ADD REPLYlink modified 18 months ago • written 18 months ago by Lesley Sitter540
1
gravatar for Joe
18 months ago by
Joe18k
United Kingdom
Joe18k wrote:

I spoke to Andrew, the lead dev for the tool.

The reason they are box plots is because depending on which order you consider each new genome, the impact on the size of the core/accessory is different. So, all the genomes are randomly sampled N times, and the impact they have on the plots shown as a box plot/distribution.

ADD COMMENTlink written 18 months ago by Joe18k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 996 users visited in the last hour