Assemble bacterial .fastq files and find differences (SPAdes followed by MUMmer)
1
0
Entering edit mode
2.1 years ago
LRStar ▴ 190

This is somewhat of a follow-up from a previous post (https://www.biostars.org/p/395813/). I was recently given 4 paired-end .fastq files (where each read has about 150 bases), each from a different strain of bacteria. My assignment is to 1) assemble each strain into its full circle and 2) perform a pairwise comparison between the four strains to discover insertions, deletions, and duplications between them.

Based on suggestions from the previous post, I tentatively have the following pipeline in mind: 1) Run fastqc, 2) Create contigs using SPAdes, 3) Fully align contigs into their "full circle or full linear alignment" using MUMmer and examine for insertions, deletions, and duplications between the files.

The fastq output looks good. So, I ran spades.py -k 21,33,55,77 --careful -1 read1.fastq -2 read2.fastq on each sample. I then used R to examine the contigs.fasta files in the main output directory (I think this means the output from using a kmer of 77). Now, I am really lost trying to determine if the output looks decent (i.e. ready for input to MUMmer for comparison between the sequences). Here were my findings:

1) Across the four samples, the number of contigs ranged between 40 and 100.
2) The 5-number-summary of the contig lengths were as follows:

##     Sample Min    Q1   Med      Q3    Max
## 1 Sample 1  78 155.0 520.5 14542.5 199284
## 2 Sample 2  78 237.5 277.0  8684.5 498771
## 3 Sample 3  78 247.0 363.0 34922.0 364001
## 4 Sample 4  78 240.5 268.0  1889.0 364018


3) The 5-number summary of the contig coverage were as follows:

##     Sample      Min        Q1      Med       Q3  Max
## 1 Sample 1 0.253165 15.477241 19.47476 33.14796  847
## 2 Sample 2 0.415254  0.805012 20.91779 35.29753  771
## 3 Sample 3 0.770833  0.907975 28.25000 33.77679 1120
## 4 Sample 4 0.544141  0.782492  0.96732 46.50450 1262


My first question is: Do any of these values seem worrisome? (i.e. the large range in the number of contigs? or, certain contig coverage being less than 1? or, discrepancies in the contig coverage between samples, especially Sample 4 having a Q3 for contig coverage less than 1? etc.?)

My second question is: related to the MUMmer process (where I hope to identify differences between the sequences). I have read the MUMmer vignette a few times but am still unclear about:

A) Whether a contig.fasta file is appropriate as input (as each contig.fasta file contains 40-100 contigs in my case, rather than an "fully assembled genome" - assuming there is a difference)?

B) Whether it may be better simply to do the entire assembly process in MUMmer? Rather than align the contigs in SPAdes and then import into MUMmer for an additional alignment?

C) How I can determine how long my sequences are? For instance, MUMmer suggests only using run-mummer1 on small sequences (which they define as <10Mbp). Right now, all I have are contigs. Is it possible for me to know the sequence lengths for the 4 samples without them being fully aligned (assuming a set of contigs is not "full alignment")?

D) How I can determine how similar the sequences are? This can help me decide whether to use NUCmer, PROmer, run-mummer1, or run-mummer3. For now, as a default, I plan to simply run all.

Suggestions to any of these topics would be greatly appreciated. I am having difficult moving forward possibly due to preconceived misunderstandings about the difference between a set of contigs and a fully-aligned genome, a lack of me finding resources demonstrating the transition between SPAdes to MUMmer, and a lack of my understanding of determining the degree of similarity between these contig files.

contig bacteria spades mummer SNP • 903 views
3
Entering edit mode
2.1 years ago
Joe 19k

### 1)

Higher numbers of contigs from short read sequencing is to be expected. SPAdes automatically optimised the kmer size to give you what it considers the best assemblies. 40-100 contigs doesn't sound outlandish to me. I'm guessing if this is assignment data, they've given you reasonably solid FASTQ data.

If your coverage numbers are based on what SPAdes puts in the fasta headers, that isn't sequence coverage, it's kmer coverage, so a low number is not necessarily worrisome.

My advice would be to ditch R and use some proper sequencing QC programs like Qualimap and Quast. You will need other indicative numbers such as the N50 and average length etc, to say much at all.

### 2)

A) A contigs.fasta should be an appropriate input for an aligner like mummer, however the alignments might be trash without first sorting the contigs by order. For this, I would suggest finding a reference closed genome (if you know what the organism for the fastqs is), and using progessiveMauve to reorder the contigs. You can then concatenate them and feed this in to mummer. Alternatively (and possibly better), is to use mummer to align each of your query genomes against the reference, and call variants that way (which could be considerably more accurate and easy to interpret, assuming you can indeed get such a reference). Ideally, you'd have perfectly complete and closed data, but thats not an option with short reads alone.

From the mummer webpage:

MUMmer can also align incomplete genomes; it can easily handle the 100s or 1000s of contigs from a shotgun sequencing project, and will align them to another set of contigs or a genome using the NUCmer program included with the system.

B) Unless I'm forgetting something about mummer you couldn't do the whole process in mummer even if you wanted to. SPAdes is an assembler, not an aligner. Running mummer on short reads would achieve nothing really.

C) Quast should give you sequence lengths etc, but you can calculate this on a contig by contig basis and sum the results easily in the shell. For a bacteria, its highly unlikely you'll have >10Mb, so I wouldn't worry about that. The number you get wont be perfect, as some of the contigs in your file will likely be junk, but it should get you a number to within a few hundred thousand base pairs.

D) You need to clarify what you mean by "different strains of bacteria" to answer this. Are these literally strains of the same species? If so, they are likely close enough for nucmer to give good results. If they are different species, you may need to experiment; start with nucmer and work upwards. I'm unconvinced about alignment from 6-frame translations as I don't know how good mummer's gene predictor is and this could give a lot of false positives.

Finally, you will probably want to consider re-mapping your reads to your assemblies with bwa/bowtie or similar, then calling variants with VarScan or BreSeq (if they're sufficiently close you could do this with a reference sequence, and avoid the need for assembly etc at all). This will check all of the reads for each assembly and see if there are variations versus the called base in the final assembly.

0
Entering edit mode

Thank you for your helpful response @Joe. My advisor is looking into finding out more about the samples but believes they are likely from the same strain. As a result, I blasted some of the contigs from the sample and downloaded a possible reference genome (I'll call it h.fasta here).

I tried/researched several of your advice here.

1) I ran Quast on the contigs output from SPAdes and indeed, as you said, it looks like contig reordering is necessary. So, I wanted to use Mauve software to reorder the contigs against the h.fasta reference file. I tried to do that following these instructions (http://darlinglab.org/mauve/user-guide/reordering.html) by inputing the h.fasta reference file and the SPAdes output (contigs.fasta) file. Unfortunately, I received an error:

Exception FileNotOpened thrown from
Unknown() in gnFileSource.cpp 67
Called by Unknown()
Exited with error code: 136


2) I guess at this point, I just want to see whether this reference h.fasta file seems legitimate for each of the four samples. I decided to then use BWA as follows:

bwa index h.fasta
bwa mem ./h.fasta ./S1_R1.fastq ./S1_R2.fastq > ./S1.sam (Repeat for all four samples)


Do you know of a quick way for me to check from these mapping results, whether the reference h.fasta may be appropriate for all four of these samples? It seems that QUAST also accepts .bed files and that I can continue BWA --> .SAM pathway above to also create a .bed file (http://seqanswers.com/forums/showthread.php?t=76018). So, should I create .bed files (BWA --> .sam --> .bam --> .bed) and then use that as input to QUAST? If not, what is a simple way for me to check the alignment quality of these four samples with the reference?

Hopefully, once I have that figured out, I can also try VarScan as you suggested or the viewer in Muave (which you also referred to). I apologize again if my questions are basic. I feel that, even after reading vignettes, I still stumble upon how to connect different software and pipelines. Thank you for any suggestions.

1
Entering edit mode

For point 1), I’m not sure what’s happening with that error. I’m not that great with Java, so it could be anything AFAIK. Are you using a recent version? (Mauve isn’t my favourite tool, but I don’t know a better contig mover personally).

For 2), I would simply check all the mapping stats with Qualimap. If you have high numbers of mapped reads (95%+ or similar), then its likely a good reference. Quast is more for assembly analysis, Qualimap is for mapping quality (which you just need an indexed & sorted SAM/BAM for).

You may need to permit BWA to have some relaxed mapping settings if you’re trying to identify variants (allow one or more mismatches etc). This is really a trial and error process though, so I would just experiment and see what the numbers look like.