Question: Assembly and read comparison using kmers
gravatar for beginner_problem
12 months ago by
beginner_problem10 wrote:

I am doing some assembly VS read comparison, and I have noticed something which is quite confusing.

I have performed kmer extraction from an assembly file and the corresponding reads (got them from the NCBI SRA and Assembly database), and when I compare them, I have kmers which are present in the assembly but not present in the reads.

So I am wondering if this is possible, and if yes, how?

sequence assembly • 311 views
ADD COMMENTlink modified 12 months ago • written 12 months ago by beginner_problem10

Ok yes, you are right about that. Did not think about this case.

However I assumed modern assemblers will assemble only highly covered areas (in which case) the kmers in between (so in your example ormati) should also be contained in one of the reads.

ADD REPLYlink written 12 months ago by beginner_problem10

Please use ADD COMMENT/ADD REPLY when responding to existing posts to keep threads logically organized. This belongs under @Wouter's answer.

SUBMIT ANSWER is for new answers to original question.

ADD REPLYlink written 12 months ago by genomax89k

Well I hope it does correspond to the same set of reads used for the assembly. I just searched through NCBI, lets take this one for exmple:

and from the "Assembly" database got the .fna files, and from the SRA got the read files. Shouldn't these reads correspond to the ones used for the assembly?

ADD REPLYlink written 12 months ago by beginner_problem10
gravatar for WouterDeCoster
12 months ago by
WouterDeCoster44k wrote:

I think that would be possible yeah. I can't tell you the odds, but possible.

Extreme oversimplification, say that I have two reads: Bioinform and rmatics

You could assemble that to Bioinformatics

In the assembly, there is now the kmer ormati which is in neither of the reads.

ADD COMMENTlink written 12 months ago by WouterDeCoster44k
gravatar for Corentin
12 months ago by
Corentin450 wrote:

In addition to what WouterDeCoster mentioned, it is also possible that these kmers correspond to misassemblies. In general I try to reduce the numbers of kmers found in the assembly and not in the reads as low as possible.

Are you using different sets of reads for your assembly ? If yes, these kmers could also correspond to a region assembled from other reads.

ADD COMMENTlink written 12 months ago by Corentin450

Well I hope the reads used for the assembly correspond to the set. What I did is that I found a Biosample on NCBI, like this one and downloaded the linked assembly .fna file and the read files from the linked SRA entry. SO these reads should have been the only ones used for the assembly, or am I wrong in this assumption?

ADD REPLYlink written 12 months ago by beginner_problem10

Not necessarily, depending on the genome size, complexity and the project's budget, there can be more than one library used for the assembly (sometimes from different technologies as well, for example Illumina + PacBio).

You should have a Bioproject ID associated with your reads and assembly ("PRJEA31233" in your example), which should give you more information about the project.

You can still use only one library if it covers most of the genome (but it also depends on what you want to do and how accurate you need to be).

ADD REPLYlink written 12 months ago by Corentin450

Thank you for your reply. But if other read sets are used for the assembly, shouldnt they also be linked to the project? I checked out all the links but I usually find one, or maybe two read runs linked to a given sample.

I am very picky about this because i need to be as accurate as possible for my evaluation, so limiting this number of kmers not existing in the assembly set.

ADD REPLYlink written 11 months ago by beginner_problem10

Yes, everything should be linked to the project.

Don't forget WouterDeCouster answer, all of these kmers are not necessarily mis-assemblies.

ADD REPLYlink written 11 months ago by Corentin450
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 873 users visited in the last hour