Question: Comparing repetitiveness between 2 assemblies of the same dataset
gravatar for Adrian Pelin
4.1 years ago by
Adrian Pelin2.2k
Adrian Pelin2.2k wrote:

Hello everyone,

As most of you know, when assembling polymorphic genomes, sometimes assemblers generate various haplotypes of short regions. This ends up overestimating the haploid genome size and the assembly contains many small contigs corresponding to haplotypes.

I have estimated the genome size of my organism using k-mer counting, and the estimated size was 5.0-6.0 Mb. Now I have two assemblies, one that is 7.9 Mb and one that is 5.6 Mb. Although no doubt the bigger assembly has some unique sequence to it, is there any way to test that it contains more redundancy than the smaller assembly?

I thought that maybe splitting assembly fasta files into small k-mers and calculating their abundance is a sound approach, so here is what I did with 2 different assemblies:
> gdr18 <- read.table("gdr18_k75-85_NHC_kmer-k23.histo")
> corn <- read.table("CornmanAssembly_kmer-k23.histo")
> N <- length(gdr18[,1])
> tablegdr18 <- numeric(N)
> for(i in 1:N){
+ tablegdr18[i] <- eval(gdr18[i,1] * gdr18[i,2])
+ }
> N <- length(corn[,1])
> tablecorn <- numeric(N)
> for(i in 1:N){
+ tablecorn[i] <- eval(corn[i,1] * corn[i,2])
+ }
> head(tablegdr18)
[1] 5267262  244536   81888   30368   15220   11034
> head(tablecorn)
[1] 5208491 1065576  650718  249380  142835   96012
> sum(tablegdr18)
[1] 5678956
> sum(tablecorn)
[1] 7739989
> sum(tablegdr18[2:length(tablegdr18)])
[1] 411694
> sum(tablecorn[2:length(tablecorn)])
[1] 2531498

So the Cornman assembly is 7.8 mb and my assembly is 5.6 mb. I was asked by a reviewer why my assembly is predicting a genome size that is so much smaller and i think it is because I sued a very high k-mer when assembling to discourage haplotype building. From the above data, it looks to me that a similar amount of kmers in both assemblies are in single copies, but the cornman assembly has about 2.5 mb of repetitive sequence. I just ran a quick script to multiply the 2 columns from the .histo file.

Any thoughts? Has this approach been used before?

Thank you,

ADD COMMENTlink modified 4.1 years ago by Brian Bushnell16k • written 4.1 years ago by Adrian Pelin2.2k

May be trying to find the known repeats in both the assemblies would help. Like running tools such as censor on both the assemblies, would give you in which assembly the there are more known repeats.

ADD REPLYlink modified 4.1 years ago • written 4.1 years ago by Prakki Rama2.2k

I would suggest run k-mer based analyses.

ADD REPLYlink written 4.1 years ago by lh331k
gravatar for Brian Bushnell
4.1 years ago by
Walnut Creek, USA
Brian Bushnell16k wrote:

In addition to kmer spectrum analyses, you can shred the smaller one into (say) 100bp pieces and map to the larger, allowing multi-mapping fragments to align to all equal-scoring locations, then calculate coverage.  That will tell you approximately how much unique content the larger one has.

For example: ref=big.fa in=small.fa fastareadlen=100 ambig=all covstats=covstats.txt

Alternately, you can use BBDuk to mask the portions of one genome that share kmers with the other: ref=small.fa in=big.fa out=masked.fa ktrim=N k=31

That will tell you how many bases in the big fasta are covered by kmers in the small fasta, as well as giving you the unique sequence (all shared sequence will be N-masked).

ADD COMMENTlink modified 4.1 years ago • written 4.1 years ago by Brian Bushnell16k

Looks like these are all good ways to answer this question, and should all confirm the analysis I already performed, since the k-mer way seems the way to go. Thanks guys, I will try to report what I find.

ADD REPLYlink written 4.1 years ago by Adrian Pelin2.2k

Okay, I tried both approaches, BBmap and BBduk.

When using BBmap, i was looking at a few things in the produced covstats.txt file, particularly the 6th column, "Covered_bases". A sum of all Covered_bases essentially tells me what is the amount of bases that remains uncovered. When running ref=big.fa in=small.fa fastareadlen=100 ambig=all covstats=covstats.txt as suggested, I find a total of 6088353 covered bp, in an assembly that is 7860219 bp in size, meaning 22.5% unique content.

I thought of redoing the BBmap step, but splitting into even smaller read fragments, so I tried fastareadlen=23, and as expected, a smaller amount of unique content was reported, 13.9%.

The reason why I redid the BBmap step, is because I wanted to compare it to BBduk results. With BBduk, for params k=31 and k=23 I found the % of N being 94.00 and 96.14 respectively, meaning 4-6% unique content.

Results of BBduk at k=23 should be similar to BBmap when using fastareadlen=23, in fact BBduk should report more unique content because k-mers allow 0 mismatches, while BBmap should allow some mismatches, is that correct?

Not sure if I should be worried that BBmap reported 22.5% unique content in the bigger assembly, because that assembly was built from 454 reads, and is plagued by small contigs. If I only focus on contigs above 1kb, the big assembly size is reduced to 5166989 bp with 7.2% unique content.

What are your thoughts?

ADD REPLYlink modified 4.1 years ago • written 4.1 years ago by Adrian Pelin2.2k

By default, BBDuk actually allows a single mismatch in the middle base of the kmer unless you set "maskmiddle=false".  Therefore, even if you have a few scattered mismatches, you will still get 100% coverage as long as the mismatches are not within 1/2 of a kmer length of each other.  Sorry for not mentioning that; just set "maskmiddle=f" to use kmers allowing zero mismatches.  Alternately, you can also increase the number of mismatches allowed with "hdist=1".

Even apart from that, BBDuk will generally report higher coverage because it uses all kmers (including overlapping ones) from the reference, while BBMap - if you set to fastareadlen=23 - will NOT use overlapping sequences.  So, "BBMap fastareadlen=23" would be more equivalent to "BBDuk k=23 rskip=23", which only uses every 23rd reference 23-mer.  To replicate the kmer-masking using mapping, you would need to shred the reference into 23bp reads at high depth; the command used only shreds it to 1x depth.

BBMap does ultimately allow more mismatches than BBDuk, but but with "ambig=all", if there are multiple mapping locations and, say, 2 have 0 mismatches while 3 have 1 mismatch, only the 2 with 0 mismatches will be reported.

ADD REPLYlink modified 4.1 years ago • written 4.1 years ago by Brian Bushnell16k

Okay, that definitively explains the results! Thanks. In your opinion, what do these results indicate? Is there significant unique content in the larger assembly?

ADD REPLYlink written 4.1 years ago by Adrian Pelin2.2k

It sounds like the larger assembly mostly contains additional repetitive content from the smaller assembly with minor differences.  Sometimes it's also useful to see what percent of reads will map to the genome allowing few or no mismatches; with two assemblies, if a substantially larger percent of the reads map to one of them, that one typically has more real genomic content, and I would generally consider it a better assembly.  If one assembly is bigger but a similar fraction of reads maps to each, then the smaller one is probably better.

At a minimum, though, the larger also seems to perhaps have ~6% completely novel sequence.  You may want to try BLASTing the unique sequence in the larger assembly against nt/nr to get an idea of its taxonomy; if it seems to be related to what you expect, then I would keep it.  If not, then it's possible the extra content in the larger assembly is contamination.

ADD REPLYlink written 4.1 years ago by Brian Bushnell16k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1717 users visited in the last hour