using soap de novo assembly
1
0
Entering edit mode
4.7 years ago

when i do the soap de novo assembly there are a lot of parameters that i do not know what it refers to, they are: Maximum read length, average insert size, read orientations, Kmer size, merge level, Kmer selection, max number of transcript per locus, minimum contig length for scaffolding, max Kmer setting?

Assembly • 4.2k views
ADD COMMENT
0
Entering edit mode

Please take a look at the enter link description here

ADD REPLY
0
Entering edit mode

I think you are talking about SOAPdenovo2 assembler and here you can find the documentation, read it first in case if there is something not clear in it you can ask questions then people will be able to help but your question is too general and asking about a lot of parameters Good luck

ADD REPLY
1
Entering edit mode
4.7 years ago
Macspider ★ 3.4k

Hi walaa.shaalan,

First let me ask you: did you read the manual of the program? If not, this link will re-direct you to the command-line options of the program and relative explanation:

http://soap.genomics.org.cn/soapdenovo.html#comm2

Before continuing this conversation, read it carefully! Just as a quick overview for the sake of your knowledge, the parameters you mention mean:

  • Maximum read length: the length of the longest read in your dataset.
  • average insert size: you must know it from your sequencing experiment or you can evince it with a trick. Map a subset of ~ 10,000 reads or so to your reference with bowtie setting -I 0 -X 3000, then grep the positive values in the TLEN field of the output SAM file and plot them. You will see a peak where your average insert size is, and you can use that value as parameter. If you have Illumina reads, let's say, it might be somewhere between 400 and 800 (but depends).
  • read orientations: for example, illumina reads are usually oriented FR, which means that the first one is forward and the second one is reverse. If you don't know what this means, read this: The fastest way to check read orientation for mate pair library
  • Kmer size: this is the lenght of the words that the assembly algorithm will use from your reads. Small kmers > more overlaps and more noise; big kmers > less overlaps and less noise. It's a trade off.
  • merge level: when the assembly programs find an overlap, they are very likely to merge the two sequences together to form a single longer sequence (contig). A value between 0 and 3 will tell the program how heavy to go on this, from 0 to hero let's say. If you have an allotetraploid genome I would suggest 1, if you have a plain diploid genome 2 or 3 are good.
  • Kmer selection: I don't recall a parameter called like that, but if you mean the KmerFreqCutoff, then it is a threshold to cut off kmers depending on frequency: if you have a kmer which is represented in your kmer list and has a frequency lower than the threshold, it will be discarded and not used for the assembly.
  • max number of transcript per locus: this one I never heard.
  • minimum contig length for scaffolding: this is a parameter that you should use carefully because it can modify your N50 value, and at the same time you might lose information. Basically, contigs smaller than this size will be discarded.
  • max Kmer setting: don't know this one.

Hope it was useful, but next time before asking this you should first read the manual! It's all in there! ;)

ADD COMMENT

Login before adding your answer.

Traffic: 1986 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6