Short-long read hybrid de novo assembly - determine k-mer size
Entering edit mode
3.6 years ago
Kenny ▴ 30

Dear all,

I am trying to do a short-long read hybrid de novo assembly on the MiSeq 2x75 paired end data and TruSeq long reads data (1500-16608bp) using three different assemblers, Velvet, ABySS, and SOAPdenovo.

I have already generated:

1) Velvet contigs using MiSeq only

2) ABySS contigs using MiSeq only

3) SOAP contigs using MiSeq only

And then, I combined the contigs with the long reads. For example,

cat Velvet_67/contigs.fa LongRead.fasta > Velvet_Hybrid.fasta

Now, I want to run assemblies using the hybrid fasta and three assemblers mentioned above.

My question is, how do I select the k-mer size?

For example, my Velvet_Hybrid.fasta has min read length of 133 and max read length of 16608.


Assembly • 1.2k views
Entering edit mode
3.6 years ago

VelvetOptimiser should assist you in this regard. It allows you to check a range of k-mers and then uses a selected metric to pick the best one. It executes velvetg on each k-mer in your range.

You should be able to execute it with somethng like:

perl /home/k/kb295/programs/VelvetOptimiser-2.2.5/ \
    --hashs 29 --hashe 79 --step 2 -a --optFuncKmer "n50" \
    --optFuncCov "Lbp" --threads 12 --dir_final Assembly/ \
    --velvethfiles "-shortPaired -separate -fastq.gz Sample1Mate1.fastq.gz Sample1Mate2.fastq.gz -shortPaired2 -separate -fastq.gz Sample2Mate1.fastq.gz Sample2Mate2.fastq.gz -shortPaired3 -separate -fastq.gz Sample3Mate1.fastq.gz Sample3Mate2.fastq.gz -shortPaired4 -separate -fastq.gz Sample4Mate1.fastq.gz Sample4Mate2.fastq.gz -strand_specific" \
    --velvetgoptions "-cov_cutoff auto -exp_cov auto -read_trkg yes"

Here, I run with 4 mate-pair samples and the optimisation function (that it uses to determine optimum k-mer) is the n50. IT checks each k-mer from 29 to 79, jumping by 2 each time.

Entering edit mode

Which read categories should I pick? since I only have a single file.

Read categories are:
short (default)
short2 (same as short, but for a separate insert-size library)
shortPaired2 (see above)
long (for Sanger, 454 or even reference sequences)
Entering edit mode

I would choose long for your data. Here's what the author and developer wrote:

The first distinction is made between long and short reads. There is no strict rule to decide what is long and short, but long read alignments are stored on more detailed data structures, which take up more memory, but allow the system to completely reconstruct their path through the assembly. Typically, reads which are longer than 200 bp (e.g. 454 or capillary) would be marked as long, but if memory is insufficient, they can also be considered short.


Entering edit mode

Thank you Kevin! Appreciated :)


Entering edit mode

I am currently running velvet on my velvet_hybrid.fasta file. My command is:

for k in {93..131..2}; do velveth ./Velvet_VelvetContigs/k$k $k -long -fasta Velvet_Hybrid.fasta; velvetg ./Velvet_VelvetContigs/k$k -ins_length auto -exp_cov auto -cov_cutoff auto -read_trkg yes; done

However, I get some of the unusual messages, for example:

[0.000000] Reading FastA file Velvet_Hybrid.fasta; [16.050157] 199270 sequences found [16.050174] Done [16.097246] Reading read set file ./Velvet_VelvetContigs/k93/Sequences; [16.750961] 199270 sequences found [18.714973] Done [18.714992] 199270 sequences in total. [18.715532] Writing into roadmap file ./Velvet_VelvetContigs/k93/Roadmaps... [25.688339] Inputting sequences... [25.692383] Inputting sequence 0 / 199270 Killed [0.000001] Reading roadmap file ./Velvet_VelvetContigs/k93/Roadmaps [0.850929] 199270 roadmaps read

[0.000000] Reading roadmap file ./Velvet_VelvetContigs/k95/Roadmaps [0.951589] 199270 roadmaps read [0.953281] Creating insertion markers [1.024402] Ordering insertion markers [1.074139] Counting preNodes [1.155569] 1735089 preNodes counted, creating them now [5.339031] Irregular sequence file: are you sure your Sequence and Roadmap file come from the same source? [0.000000] Reading FastA file Velvet_Hybrid.fasta; [13.756826] 199270 sequences found


Login before adding your answer.

Traffic: 1574 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6