Tool:Compareads: Comparing Huge Metagenomic Experiments
2
4
Entering edit mode
11.2 years ago
Nico ▴ 180

Hi BioStar,

We would like to announce the open-source release of a new tool to compare huge metagenomic samples: Compareads. The goal of Compareads is to find all the similar reads between two samples and to give a similarity score based on those shared reads.

We consider that two reads (one from each sample) are similar if they share at least m non-overlapping k-mers. Compareads is designed to find those similar sequences between two samples. In a few words, given two read sets A and B, the goal of Compareads is to find the subset of reads from A which are similar to a read in B, and the subset of reads from B which are similar to a read in A.

On the publication, we show that Compareads enables to retrieve biological information while being able to scale to huge datasets. Its time and memory features make Compareads usable on read sets each composed of more than 100 million Illumina reads in a few hours and consuming 4GB of memory, and thus usable on today's personal computers.

Download link and PDF article: http://alcovna.genouest.org/compareads/

Looking forward to hearing your feedback,

Nicolas

metagenomics denovo next-gen • 2.6k views
ADD COMMENT
3
Entering edit mode
11.2 years ago

One suggestion if I may is that it would help if you provided a more explicit usage scenario for the purpose of the tool. Comparing is a somewhat generic concept and it is not clear neither from the description above nor that of the tool of what one would expect as the resulting information. Especially that the authors also mention that they don't know of other tools that have similar functionality. So that pretty much leaves the reader with nothing to compare it to.

Say I have two large metagenomics samples and I run the tool. What is the output? Reads with counts in the each sample? Sub-sequences that can be found in both samples? Can I run it by giving it randomly sheared bacterial sequences and thus perform some sort of classification with it? Can I use it to de-duplicate samples?

I would just give a simple example in the docs.

ADD COMMENT
0
Entering edit mode

Ok, you are right, it is not really clear. I updated the post to add a little more information. I also updated the docs to add a toy example!

Thanks

ADD REPLY
2
Entering edit mode
11.2 years ago

It looks interesting. It's a way to address a problem: -- how to cluster samples according to all sequencing information and not only the information in the bio databanks? -- For that purpose, I personally had been following the "blast all vs all" process a couple of years ago which, indeed as stated in the paper, is not a good solution in terms of computational time anymore. I've not looked into the paper very closely yet, but I will!

I'm wondering how you guys deal with the different size of datasets though, but it might be in the paper I've just looked up so far.

ADD COMMENT
0
Entering edit mode

Thank you! Size of samples are not really important to find wich reads from A occur in B and vice versa. But, it does matter when you look at the similarity score based on those shared reads.

For the moment, the similarity score is normalized by size of samples, but with extrem different size it might be not that relevant. So, we add a basic option on the software to only use the X first reads of samples. For example, let A be a sample with 1 million reads and B with 500 million. The software can run with only the first million reads from B and full A.

For that purpose, we are currently studying how Compareads can perform subsampling of datasets and still be reliable.

ADD REPLY

Login before adding your answer.

Traffic: 2394 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6