Question

Unique genes roary

0

Entering edit mode

3.6 years ago

Sissi ▴ 60

Hi there,

I'm running a pangenome analyisis on 20 bacterial strains using roary version 3.13, on a server running CentOS 7. I want to get the unique genes present in only one of those 20 strains but, as I've already asked here, I cannot get them.

When running:

query_pan_genome -a difference --input_set_one 1.gff --input_set_two 2.gff 3.gff 4.gff .... -g clustered_proteins

I get a csv file with some clusters that are supposed to be unique to strain1, but they are not! If I retrieve the sequence using the sqlite3 db suggested here and blast it, I find a perfect match with one of the other 19th strain (the reference one, by the way). Moreover, these genes in the reference are functionally properly annotated (i.e short-chain dehydrogenase), while in the csv is "hypotetical protein" (but that's problably is prokka annotation failure). I also tried to select the only-one-strain column from the clustered_proteins file as suggested here, but still get wrong ones. By reading this other issue, I tried the option -s but I just got less "unique" clusters, but still wrong ones. What's the problem?? Is roary really supposed to do so or not??

Thanks, Silvia

pangenome comparative genomics roary • 3.0k views

ADD COMMENT • link 3.6 years ago by Sissi ▴ 60

0

Entering edit mode

Hi Sissi,

just to understand. Since you mentioned prokka, did you use it to re-annotate the genomes? If so, that could be the problem. You basically have two different versions of each genome: prokka and the NCBI Prokaryotic Genome Annotation Pipeline

ADD REPLY • link 3.6 years ago by andres.firrincieli 3.6k

0

Entering edit mode

Hi Andres, Thank you for your reply.

I first downloaded the FASTA file from NCBI genome ( This is my Reference as example ) and then run Prokka:

for k in *.fasta; do prokka $k --outdir /home/silvia/Pseudomonas/prokka/"$k" --kingdom Bacteria --genus Pseudomonas; done

This because I'm also using a newly sequenced strain that is not deposited yet and thus, I wanted to start from the same annotation. Then I run roary in two ways

    roary -f roary-output -e --mafft -n -r *.gff
    roary -f roary-output2 -e --mafft -n -r -s *.gff

And this brought me to the problem above.

ADD REPLY • link 3.6 years ago by Sissi ▴ 60

0

Entering edit mode

This because I'm also using a newly sequenced strain that is not deposited yet and thus, I wanted to start from the same annotation.

This is just my opinion. If the reference genomes have been already annotated there is no need to run the annotation again; unless you demostrate that your annotation pipeline is far better than the ones used for the reference genomes. You should use the gff files from the ncbi database; that is your reference. Keep in mind that different annotation pipeline will give you different results. Therefore, if your are mainly interested in clusters occuring only in one strain, use tblastn to double check that your genes of interest are actually missing the other strains

ADD REPLY • link 3.6 years ago by andres.firrincieli 3.6k

0

Entering edit mode

Following your suggestion, I tried to use the ncbi gff files and prokka gff file for the newly sequenced strain from Prokka, but first got:

Input file contains duplicate gene IDs, attempting to fix by adding a unique suffix, new GFF in the fixed_input_files directory

and then stopped:

 Use of uninitialized value in require at /usr/local/lib64/perl5/Encode.pm line 70.
Saving 7 x 7 in image
geom_path: Each group consists of only one observation. Do you need to adjust
the group aesthetic?
geom_path: Each group consists of only one observation. Do you need to adjust
the group aesthetic?
Saving 7 x 7 in image
geom_path: Each group consists of only one observation. Do you need to adjust
the group aesthetic?
geom_path: Each group consists of only one observation. Do you need to adjust
the group aesthetic?
Use of uninitialized value in require at (eval 1523) line 1.
Illegal division by zero at /usr/local/share/perl5/Bio/Roary/External/GeneAlignmentFromNucleotides.pm line 46.

So, I tried to remove the only gff from Prokka and still, roary doesn't like the gff from ncbi:

All input files have been excluded from analysis. Please check you have valid GFF files, with annotation and a FASTA sequence at the end. Better still, reannotate your FASTA file with PROKKA. at /usr/local/share/perl5/Bio/Roary/CommandLine/Roary.pm line 273.

And there are no output files.

Edit. Btw, I got the same problems with unique genes also with other samples.

ADD REPLY • link 3.6 years ago by Sissi ▴ 60

0

Entering edit mode

I really wonder if people check the results before publishing.

before using the NCBI gff file check this: https://github.com/sanger-pathogens/Roary/issues/120

ADD REPLY • link 3.6 years ago by andres.firrincieli 3.6k

0

Entering edit mode

Ok so,

NCBI annotation is not an option, because the newly sequenced strain doesn not have the NCBI annotation yet and a combination of
NCBI+Prokka of course is not working.
Prokka + roary + query_pan_genome for strainX singletons gives 38 genes that are actually present in strainY (according to blastn).
Prokka + roary -s option + query-pan_genome for strain X singletons gives 14 genes that are actually present in strainY (according to blastn).
Prokka run with --locustag + roary + query_pan_genome for strainX singletons gives 39 genes that are actually present in strainY
(according to blastn).
Prokka with --locustag + roary -s gives 14 genes that are actually present in strainY (according to blastn).
A pangenome analysis run on Kbase server with OrthoMCL gave 421 singletons. I've just checked some of them on blastn and they are 100% similar to strainY.

This is amazing, really.

(Ps. I'm probably working where you got your PhD ;) )

ADD REPLY • link 3.6 years ago by Sissi ▴ 60

0

Entering edit mode

I'm probably working where you got your PhD

Finding my email should not be a problem then. If you contact me we can definitely solve this problem :).

NCBI+Prokka of course is not working.

This is the best option. The problem is that the gff from NCBI do not contain the nucleotide sequence at the end of the file hence, you need to find a tool that convert a gbk file into a gff format compatible with roary.

ADD REPLY • link 3.6 years ago by andres.firrincieli 3.6k