when i do the soap de novo assembly there are a lot of parameters that i do not know what it refers to, they are: Maximum read length, average insert size, read orientations, Kmer size, merge level, Kmer selection, max number of transcript per locus, minimum contig length for scaffolding, max Kmer setting?
First let me ask you: did you read the manual of the program? If not, this link will re-direct you to the command-line options of the program and relative explanation:
Before continuing this conversation, read it carefully! Just as a quick overview for the sake of your knowledge, the parameters you mention mean:
- Maximum read length: the length of the longest read in your dataset.
- average insert size: you must know it from your sequencing experiment or you can evince it with a trick. Map a subset of ~ 10,000 reads or so to your reference with bowtie setting -I 0 -X 3000, then grep the positive values in the TLEN field of the output SAM file and plot them. You will see a peak where your average insert size is, and you can use that value as parameter. If you have Illumina reads, let's say, it might be somewhere between 400 and 800 (but depends).
- read orientations: for example, illumina reads are usually oriented FR, which means that the first one is forward and the second one is reverse. If you don't know what this means, read this: The fastest way to check read orientation for mate pair library
- Kmer size: this is the lenght of the words that the assembly algorithm will use from your reads. Small kmers > more overlaps and more noise; big kmers > less overlaps and less noise. It's a trade off.
- merge level: when the assembly programs find an overlap, they are very likely to merge the two sequences together to form a single longer sequence (contig). A value between 0 and 3 will tell the program how heavy to go on this, from 0 to hero let's say. If you have an allotetraploid genome I would suggest 1, if you have a plain diploid genome 2 or 3 are good.
- Kmer selection: I don't recall a parameter called like that, but if you mean the KmerFreqCutoff, then it is a threshold to cut off kmers depending on frequency: if you have a kmer which is represented in your kmer list and has a frequency lower than the threshold, it will be discarded and not used for the assembly.
- max number of transcript per locus: this one I never heard.
- minimum contig length for scaffolding: this is a parameter that you should use carefully because it can modify your N50 value, and at the same time you might lose information. Basically, contigs smaller than this size will be discarded.
- max Kmer setting: don't know this one.
Hope it was useful, but next time before asking this you should first read the manual! It's all in there! ;)