Question: VelvetOptimiser: optimum cov_cutoff, PE insert stats, 2 Insert Size Libraries, N's without scaffolding, Using 2 Insert Size Libraries
gravatar for drtamermansour
4.4 years ago by
United States
drtamermansour40 wrote:

I am trying to use VelvetOptimiser to do denovo assembly of a plant. I have 2 libraries; 100PE with insert size ~ 400bp and 150PE mate pair library with 12-18k insert size. I have several questions:

A) I tried to do initial assembly with only 1 library (the 100PE with insert size 440bp) like this: -s 35 -e 47 -f '-fastq -shortPaired pe_merged -fastq -short s_se' -a -o '-unused_reads yes' -t 1
The results were good (Final graph has 731536 nodes and n50 of 5769, max 55415, total 149725954, using 51729078/69351035 reads). The VelvetOptimiser identified K=41 and set the Expected coverage to 10.
First question: VelvetOptimiser took 6 itrations to declare the Optimum value of coverage cutoff = 1.89, however it ran the final velvetg with (-cov_cutoff 1.44) !!!! Any one can explain this? should I re-run velvetg with the optimum  coverage cutoff?
Second question: the Paired Library insert stats gave me 2 lines:
                   Paired-end library 1 has length: 438, sample standard deviation: 19
                   Paired-end library 1 has length: 441, sample standard deviation: 36
Why this is coming out as two different libraries ?? 
Third question: The final output has runs of "N's". According to Quast which is another software to assess the assembly, # N's per 100 kbp = 10271. I did not use the "-scaffolding" option and I do not see the VelvetOptimiser passing this to the velevtg. Can anyone explain why the velvet is generating this N's?

B) Then I tried to combine the 2 libraries like this: -s 31 -e 47 -f '-fastq -shortPaired pe_merged -fastq -shortPaired2 mp_merged -fastq -short s_se_combined' -a -o '-unused_reads yes' -t 1
The results came out VERY bad. The expected coverage even went down to 9. Here is the results of the final itaration:
Velvet hash value: 47
Roadmap file size: 7263105695
Total number of contigs: 387128
n50: 811
length of longest contig: 17283
Total bases in contigs: 158410389
Number of contigs > 1k: 36064
Total bases in contigs > 1k: 68344726
Paired Library insert stats:
Paired-end library 1 has length: 437, sample standard deviation: 21
Paired-end library 2 has length: 211, sample standard deviation: 105
Paired-end library 1 has length: 440, sample standard deviation: 57
Paired-end library 2 has length: 220, sample standard deviation: 138
Paired-end library 1 has length: 440, sample standard deviation: 62
Paired-end library 2 has length: 221, sample standard deviation: 149

So my Fourth question is how to do the genome assembly Using 2 Insert Size Libraries ? Is it better to finish assembly from the 1st stage and use another software (SSPACE) for integration of mate pair sequences?

Thank you  

ADD COMMENTlink modified 4.4 years ago • written 4.4 years ago by drtamermansour40
gravatar for SES
4.4 years ago by
Vancouver, BC
SES8.1k wrote:

First, I don't think velvet/VelvetOptimiser are appropriate for assembling a (typical) plant genome, especially with the type of data you have (but see below, I've updated my answer). AllPaths would probably be more appropriate but I will try to answer each of your questions.

1) VelvetOptimiser finds the best coverage cutoff and uses that with velvetg, so running velvetg manually likely won't change the result.

2) This may have to do with how velvet was run. For example, '-shortPaired' would assume paired-end reads but I don't think that would be the best option for long-insert mate pair libraries. I've never used velvet with mate pair data, so I can't say for sure what the command would be. (UPDATE: It looks like the velvetg command you want is '-shortMatePaired yes' which is not documented except on the program menu. See also, this link for ideas on combining libraries with velvet.)

3) Velvet will do scaffolding, so that is likely what you are seeing.

4) It's not uncommon to see the results deteriorate when you have very high coverage given that increasing coverage adds more repetitive sequences. Though, this is pretty extreme. It is actually hard to diagnose by just looking at the results and not knowing the number of reads in that other file of unpaired sequences added to the second assembly. I've found that adding coverage improves things to a point, and you could investigate this by subsampling your data. 

5) I'm not sure how the best way to deal with this in velvet. One approach would be to assemble the libraries separately, but this often results in assemblies that are difficult to merge. Other programs, like SOAPdenovo or Ray, will let you input the library sizes prior to assembly. Hope that helps.

EDIT: To add context to my first statement, a 'typical' plant genome is highly repetitive and several gigabases in size. This usually leads to very high memory usage with velvet, and combined with VelvetOptimiser, assembling a large genome will likely not be possible. 

If what you are seeing in terms of genome size is what you expect (~150 Mbp) though, then these are definitely appropriate programs and I would try to tune velvet/VelvetOptimiser further, especially with respect to my edit above on combining libraries (I would personally try AllPaths, as well). It would be good to know the expected genome size and repeat content, also.

ADD COMMENTlink modified 4.4 years ago • written 4.4 years ago by SES8.1k

Thank you @SES

I have some clarifications on my questions:

1) coverage cutoff: The optimizer has already defined the optimum cutoff but the problem is that it used another value for the final run !!

2) Insert size stats: My problem is that the optimizer output has 2 different insert size for each paired end library?

3) Scaffolding: I do agree with you that this is the Velvet doing scaffolding. My point was that the manual of velevet says that the default value for the scaffolding option is “no”

4) Integrating a mate pair  library: The link you referred to is great. I will follow their advise about reversing the orientation of mate pairs then  I will update here. Also I will try “AllPaths” . The calculated expected genome size (based on the 100PE library) is ~ 150MB. How can I predict the repeat content ? 

ADD REPLYlink written 4.4 years ago by drtamermansour40

1) I will have to run VelvetOptimiser and see what happens because I'm not sure why you would see that. It could be that what you are seeing is the coverage cutoff for one hash length vs. another.

2) This is normal if you have more than one input file, it will report the insert size for each file.

3) From the Velvet manual under the section "Scaffolding" it says, "By default, Velvet will try to scaffold contigs that it cannot quite connect. This results in sequences of N s in the contigs.fa file, which correspond to the estimated distance between two neighbouring contigs."

4) I stand corrected, the option for mate pairs is mentioned in the Velvet manual under the section on velvetg. For finding repeat content, I recommend Transposome. I'm the author of that tool, so don't hesitate to send me a message or email if you have questions about the usage or installation.

ADD REPLYlink written 4.4 years ago by SES8.1k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2492 users visited in the last hour