This is somewhat of a follow-up from a previous post (https://www.biostars.org/p/395813/). I was recently given 4 paired-end .fastq files (where each read has about 150 bases), each from a different strain of bacteria. My assignment is to 1) assemble each strain into its full circle and 2) perform a pairwise comparison between the four strains to discover insertions, deletions, and duplications between them.
Based on suggestions from the previous post, I tentatively have the following pipeline in mind: 1) Run fastqc, 2) Create contigs using SPAdes, 3) Fully align contigs into their "full circle or full linear alignment" using MUMmer and examine for insertions, deletions, and duplications between the files.
The fastq output looks good. So, I ran
spades.py -k 21,33,55,77 --careful -1 read1.fastq -2 read2.fastq on each sample. I then used R to examine the contigs.fasta files in the main output directory (I think this means the output from using a kmer of 77). Now, I am really lost trying to determine if the output looks decent (i.e. ready for input to MUMmer for comparison between the sequences). Here were my findings:
1) Across the four samples, the number of contigs ranged between 40 and 100.
2) The 5-number-summary of the contig lengths were as follows:
## Sample Min Q1 Med Q3 Max ## 1 Sample 1 78 155.0 520.5 14542.5 199284 ## 2 Sample 2 78 237.5 277.0 8684.5 498771 ## 3 Sample 3 78 247.0 363.0 34922.0 364001 ## 4 Sample 4 78 240.5 268.0 1889.0 364018
3) The 5-number summary of the contig coverage were as follows:
## Sample Min Q1 Med Q3 Max ## 1 Sample 1 0.253165 15.477241 19.47476 33.14796 847 ## 2 Sample 2 0.415254 0.805012 20.91779 35.29753 771 ## 3 Sample 3 0.770833 0.907975 28.25000 33.77679 1120 ## 4 Sample 4 0.544141 0.782492 0.96732 46.50450 1262
My first question is: Do any of these values seem worrisome? (i.e. the large range in the number of contigs? or, certain contig coverage being less than 1? or, discrepancies in the contig coverage between samples, especially Sample 4 having a Q3 for contig coverage less than 1? etc.?)
My second question is: related to the MUMmer process (where I hope to identify differences between the sequences). I have read the MUMmer vignette a few times but am still unclear about:
A) Whether a contig.fasta file is appropriate as input (as each contig.fasta file contains 40-100 contigs in my case, rather than an "fully assembled genome" - assuming there is a difference)?
B) Whether it may be better simply to do the entire assembly process in MUMmer? Rather than align the contigs in SPAdes and then import into MUMmer for an additional alignment?
C) How I can determine how long my sequences are? For instance, MUMmer suggests only using run-mummer1 on small sequences (which they define as <10Mbp). Right now, all I have are contigs. Is it possible for me to know the sequence lengths for the 4 samples without them being fully aligned (assuming a set of contigs is not "full alignment")?
D) How I can determine how similar the sequences are? This can help me decide whether to use NUCmer, PROmer, run-mummer1, or run-mummer3. For now, as a default, I plan to simply run all.
Suggestions to any of these topics would be greatly appreciated. I am having difficult moving forward possibly due to preconceived misunderstandings about the difference between a set of contigs and a fully-aligned genome, a lack of me finding resources demonstrating the transition between SPAdes to MUMmer, and a lack of my understanding of determining the degree of similarity between these contig files.