Question: using soap de novo assembly
0
gravatar for walaa.shaalan
2.7 years ago by
walaa.shaalan0 wrote:

when i do the soap de novo assembly there are a lot of parameters that i do not know what it refers to, they are: Maximum read length, average insert size, read orientations, Kmer size, merge level, Kmer selection, max number of transcript per locus, minimum contig length for scaffolding, max Kmer setting?

assembly • 3.1k views
ADD COMMENTlink modified 2.7 years ago by Macspider2.8k • written 2.7 years ago by walaa.shaalan0

Please take a look at the enter link description here

ADD REPLYlink written 2.7 years ago by seta1.2k

I think you are talking about SOAPdenovo2 assembler and here you can find the documentation, read it first in case if there is something not clear in it you can ask questions then people will be able to help but your question is too general and asking about a lot of parameters Good luck

ADD REPLYlink modified 2.7 years ago • written 2.7 years ago by Medhat8.4k
1
gravatar for Macspider
2.7 years ago by
Macspider2.8k
Vienna - BOKU
Macspider2.8k wrote:

Hi walaa.shaalan,

First let me ask you: did you read the manual of the program? If not, this link will re-direct you to the command-line options of the program and relative explanation:

http://soap.genomics.org.cn/soapdenovo.html#comm2

Before continuing this conversation, read it carefully! Just as a quick overview for the sake of your knowledge, the parameters you mention mean:

  • Maximum read length: the length of the longest read in your dataset.
  • average insert size: you must know it from your sequencing experiment or you can evince it with a trick. Map a subset of ~ 10,000 reads or so to your reference with bowtie setting -I 0 -X 3000, then grep the positive values in the TLEN field of the output SAM file and plot them. You will see a peak where your average insert size is, and you can use that value as parameter. If you have Illumina reads, let's say, it might be somewhere between 400 and 800 (but depends).
  • read orientations: for example, illumina reads are usually oriented FR, which means that the first one is forward and the second one is reverse. If you don't know what this means, read this: The fastest way to check read orientation for mate pair library
  • Kmer size: this is the lenght of the words that the assembly algorithm will use from your reads. Small kmers > more overlaps and more noise; big kmers > less overlaps and less noise. It's a trade off.
  • merge level: when the assembly programs find an overlap, they are very likely to merge the two sequences together to form a single longer sequence (contig). A value between 0 and 3 will tell the program how heavy to go on this, from 0 to hero let's say. If you have an allotetraploid genome I would suggest 1, if you have a plain diploid genome 2 or 3 are good.
  • Kmer selection: I don't recall a parameter called like that, but if you mean the KmerFreqCutoff, then it is a threshold to cut off kmers depending on frequency: if you have a kmer which is represented in your kmer list and has a frequency lower than the threshold, it will be discarded and not used for the assembly.
  • max number of transcript per locus: this one I never heard.
  • minimum contig length for scaffolding: this is a parameter that you should use carefully because it can modify your N50 value, and at the same time you might lose information. Basically, contigs smaller than this size will be discarded.
  • max Kmer setting: don't know this one.

Hope it was useful, but next time before asking this you should first read the manual! It's all in there! ;)

ADD COMMENTlink written 2.7 years ago by Macspider2.8k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 535 users visited in the last hour