I'm having a problem which I'm having difficultly pinpointing so sorry if I'm not being too clear.
I generated simulated reads using Art at different read lengths and fold coverage. I'll include an example of the command line I used to make sure that I did that step correctly. I generated a number of reads for lengths, 50, 75, 100 and 250 each with fold coverages of 20, 49, 60 , 80 and 100. Basically I'll be getting a load of read data at the end of the month which I'll have to assemble. So I wanted to have some idea how to do this.
In this example, -i is the input ref genome, -l is the length of the reads, -f should be the coverage, -m the insert size, -s the standard deviation of read length, -o is the output name, and -na should suppress sam files and aln files as in this case I only want the reads.
art_illumina -i /Ref/E.coli.O104:H4.fasta -l 100 -f 100 -m 200 -s 10 -o ec_pe_l100_f100_ -na
The simulated seemed to run successfully so I was happy.
Q Version 2.1.8 (Mar 8, 2014)
Contact: Weichun Huang at firstname.lastname@example.org
Warning: your simulation will not output any ALN or SAM file with your parameter settings!
Paired-end sequencing simulation
Total CPU time used: 147.16
The random seed for the run: 1401371304
Parameters used during run
Read Length: 100
Genome masking 'N' cutoff frequency: 1 in 100
Fold Coverage: 100X
Mean Fragment Length: 200
Standard Deviation: 10
Profile Type: Combined
First Read: EMP100R1 (built-in profile)
Second Read: EMP100R2 (built-in profile)
FASTQ Sequence Files:
the 1st reads: ec_pe_l100_f100_1.fq
the 2nd reads: ec_pe_l100_f100_2.fq
I've used these reads in the assemblies which are running at the moment. However I also read about quake error correction tool which is supposed to correct errors in your read files, leading to better assemblies. Which I also decided to test. Quake requires a fold coverage of at least 15x which I should have. I used the following command to run quake. The k value was calculated according to the formula they provide, k=log4*[Genomesize], where my genome size is rough 5Mbp as it's E. coli, which gave me 15. -u just says not to throw away uncorrected reads so it will maintain the pairs. And -f is a file that contians the names of my reads on the same line as they're paired end.
quake.py -f ec_pe_l100_f100_list -u -k 15
However whenever I run quake, it just keeps exiting with the error :
Optimization of distribution likelihood function to choose k-mer cutoff failed. Very likely you have set the value of k too high or not provided adequate coverage (>15x). Inspect the k-mer counts for a clear separation of the error and true k-mer distributions.
And when I check the log file I get :
Error in est.cov(cov) :
Cannot identify reliable trusted k-mer coverage peak. Please verify the integrity of your dataset (e.g. by running kmer_hist.r). If there
is no clear trusted k-mer distribution, the coverage is too low to correct these reads. If there is, please e-mail email@example.com
and choose the cutoff by hand in the meantime
So I ran this script and I got a graph where there is no clear coverage. I tried again with a lower k-value of 8 which the histogram seems to suggest, and the program fails again with the a different error: OSError: [Errno 7] Argument list too long
I guess I'm not sure if it's a problem with quake or the simulated reads. I tried it on a real reads dataset I downloaded from the sra and the quake program works fine. Has anyone encountered similar problems that could guide me in the right direction? Does anyone know how to ensure the reads are at the coverage they say they are? I'd really appreciate any advice you have and please let me know if you need me to provide additional information.
Summary : Generated simulated reads with art, and quake won't work on them. Unsure if it's a quake problem or a problem with the reads.