Question: Assembly and read comparison using kmers
gravatar for beginner_problem
4 weeks ago by
beginner_problem10 wrote:

I am doing some assembly VS read comparison, and I have noticed something which is quite confusing.

I have performed kmer extraction from an assembly file and the corresponding reads (got them from the NCBI SRA and Assembly database), and when I compare them, I have kmers which are present in the assembly but not present in the reads.

So I am wondering if this is possible, and if yes, how?

sequence assembly • 156 views
ADD COMMENTlink modified 4 weeks ago • written 4 weeks ago by beginner_problem10

Ok yes, you are right about that. Did not think about this case.

However I assumed modern assemblers will assemble only highly covered areas (in which case) the kmers in between (so in your example ormati) should also be contained in one of the reads.

ADD REPLYlink written 4 weeks ago by beginner_problem10

Please use ADD COMMENT/ADD REPLY when responding to existing posts to keep threads logically organized. This belongs under @Wouter's answer.

SUBMIT ANSWER is for new answers to original question.

ADD REPLYlink written 4 weeks ago by genomax73k

Well I hope it does correspond to the same set of reads used for the assembly. I just searched through NCBI, lets take this one for exmple:

and from the "Assembly" database got the .fna files, and from the SRA got the read files. Shouldn't these reads correspond to the ones used for the assembly?

ADD REPLYlink written 4 weeks ago by beginner_problem10
gravatar for WouterDeCoster
4 weeks ago by
WouterDeCoster41k wrote:

I think that would be possible yeah. I can't tell you the odds, but possible.

Extreme oversimplification, say that I have two reads: Bioinform and rmatics

You could assemble that to Bioinformatics

In the assembly, there is now the kmer ormati which is in neither of the reads.

ADD COMMENTlink written 4 weeks ago by WouterDeCoster41k
gravatar for Corentin
4 weeks ago by
Corentin430 wrote:

In addition to what WouterDeCoster mentioned, it is also possible that these kmers correspond to misassemblies. In general I try to reduce the numbers of kmers found in the assembly and not in the reads as low as possible.

Are you using different sets of reads for your assembly ? If yes, these kmers could also correspond to a region assembled from other reads.

ADD COMMENTlink written 4 weeks ago by Corentin430

Well I hope the reads used for the assembly correspond to the set. What I did is that I found a Biosample on NCBI, like this one and downloaded the linked assembly .fna file and the read files from the linked SRA entry. SO these reads should have been the only ones used for the assembly, or am I wrong in this assumption?

ADD REPLYlink written 4 weeks ago by beginner_problem10

Not necessarily, depending on the genome size, complexity and the project's budget, there can be more than one library used for the assembly (sometimes from different technologies as well, for example Illumina + PacBio).

You should have a Bioproject ID associated with your reads and assembly ("PRJEA31233" in your example), which should give you more information about the project.

You can still use only one library if it covers most of the genome (but it also depends on what you want to do and how accurate you need to be).

ADD REPLYlink written 4 weeks ago by Corentin430

Thank you for your reply. But if other read sets are used for the assembly, shouldnt they also be linked to the project? I checked out all the links but I usually find one, or maybe two read runs linked to a given sample.

I am very picky about this because i need to be as accurate as possible for my evaluation, so limiting this number of kmers not existing in the assembly set.

ADD REPLYlink written 12 days ago by beginner_problem10

Yes, everything should be linked to the project.

Don't forget WouterDeCouster answer, all of these kmers are not necessarily mis-assemblies.

ADD REPLYlink written 8 days ago by Corentin430
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2018 users visited in the last hour