Question

VelvetOptimiser: optimum cov_cutoff, PE insert stats, 2 Insert Size Libraries, N's without scaffolding, Using 2 Insert Size Libraries

0

Entering edit mode

9.6 years ago

drtamermansour ▴ 40

I am trying to use VelvetOptimiser to do denovo assembly of a plant. I have 2 libraries; 100PE with insert size ~ 400bp and 150PE mate pair library with 12-18k insert size. I have several questions:

A) I tried to do initial assembly with only 1 library (the 100PE with insert size 440bp) like this:

VelvetOptimiser.pl -s 35 -e 47 -f '-fastq -shortPaired pe_merged -fastq -short s_se' -a -o '-unused_reads yes' -t 1

The results were good (Final graph has 731536 nodes and n50 of 5769, max 55415, total 149725954, using 51729078/69351035 reads). The VelvetOptimiser identified K=41 and set the Expected coverage to 10.

First question: VelvetOptimiser took 6 iterations to declare the Optimum value of coverage cutoff = 1.89, however it ran the final velvetg with (-cov_cutoff 1.44)! Any one can explain this? should I re-run velvetg with the optimum coverage cutoff?

Second question: The Paired Library insert stats gave me 2 lines:

Paired-end library 1 has length: 438, sample standard deviation: 19
Paired-end library 1 has length: 441, sample standard deviation: 36

Why this is coming out as two different libraries?

Third question: The final output has runs of "N's". According to Quast which is another software to assess the assembly, #N's per 100 kbp=10271. I did not use the "-scaffolding" option and I do not see theVelvetOptimiserpassing this to the velevtg. Can anyone explain why the velvet is generating this N's?

B) Then I tried to combine the 2 libraries like this:

VelvetOptimiser.pl -s 31 -e 47 -f '-fastq -shortPaired pe_merged -fastq -shortPaired2 mp_merged -fastq -short s_se_combined' -a -o '-unused_reads yes' -t 1

The results came out VERY bad. The expected coverage even went down to 9. Here is the results of the final itaration:

Velvet hash value: 47 Roadmap file size: 7263105695 Total number of contigs: 387128 n50: 811 length of longest contig: 17283 Total bases in contigs: 158410389 Number of contigs > 1k: 36064 Total bases in contigs > 1k: 68344726 Paired Library insert stats: Paired-end library 1 has length: 437, sample standard deviation: 21 Paired-end library 2 has length: 211, sample standard deviation: 105 Paired-end library 1 has length: 440, sample standard deviation: 57 Paired-end library 2 has length: 220, sample standard deviation: 138 Paired-end library 1 has length: 440, sample standard deviation: 62 Paired-end library 2 has length: 221, sample standard deviation: 149

So my Fourth question is how to do the genome assembly Using 2 Insert Size Libraries? Is it better to finish assembly from the 1st stage and use another software (SSPACE) for integration of mate pair sequences?

Thank you

denovoassembly VelvetOptimiser • 6.4k views

ADD COMMENT • link updated 3.1 years ago by Ram 43k • written 9.6 years ago by drtamermansour ▴ 40

Ram · Answer 1 · 2014-10-14

First, I don't think velvet/VelvetOptimiser are appropriate for assembling a (typical) plant genome, especially with the type of data you have (but see below, I've updated my answer). AllPaths would probably be more appropriate but I will try to answer each of your questions.

1) VelvetOptimiser finds the best coverage cutoff and uses that with velvetg, so running velvetg manually likely won't change the result.

2) This may have to do with how velvet was run. For example, '-shortPaired' would assume paired-end reads but I don't think that would be the best option for long-insert mate pair libraries. I've never used velvet with mate pair data, so I can't say for sure what the command would be. (UPDATE: It looks like the velvetg command you want is '-shortMatePaired yes' which is not documented except on the program menu. See also, this linkfor ideas on combining libraries with velvet.)

3) Velvet will do scaffolding, so that is likely what you are seeing.

4) It's not uncommon to see the results deteriorate when you have very high coverage given that increasing coverage adds more repetitive sequences. Though, this is pretty extreme. It is actually hard to diagnose by just looking at the results and not knowing the number of reads in that other file of unpaired sequences added to the second assembly. I've found that adding coverage improves things to a point, and you could investigate this by subsampling your data.

5) I'm not sure how the best way to deal with this in velvet. One approach would be to assemble the libraries separately, but this often results in assemblies that are difficult to merge. Other programs, like SOAPdenovo or Ray, will let you input the library sizes prior to assembly. Hope that helps.

EDIT: To add context to my first statement, a 'typical' plant genome is highly repetitive and several gigabases in size. This usually leads to very high memory usage with velvet, and combined with VelvetOptimiser, assembling a large genome will likely not be possible.

If what you are seeing in terms of genome size is what you expect (~150 Mbp) though, then these are definitely appropriate programs and I would try to tune velvet/VelvetOptimiser further, especially with respect to my edit above on combining libraries (I would personally try AllPaths, as well). It would be good to know the expected genome size and repeat content, also.