Neophyte in WGS. Advise for an hypothetical pipeline.
3.1 years ago
lugonauta

Hi everyone, I'm new in WGS, in fact it is my first time! I´m trying to sequence a Lactobacillus fermentum strain. I would like to deposit it on GenBank, so I need to obtain a robust and reliable final product, but it is not necessary to close the genome (maybe a draft genome composed by 100-200 contigs or so, I don't know if there is a maximum allowed). I sent the strain to a sequencing service (they use Miseq and Hiseq platfoms), and they sent me back the fastq files (paired end). I´m looking for some advise regarding the WGS strategy I´m following:

• spades for de novo assembling
• quast for quality assessment of the assembly
• mauves for reference-guided sorting of contigs
• manually joint of contigs, if possible, based on the mauves fasta output (I don´t know how to do this, or if there is a suitable and user-friendly software for doing it)

I obtained 521 contigs from spades. Quast gave me back the following quality report (All statistics are based on contigs of size >= 500 bp):

Statistics without reference    Galaxy269__SPAdes_on_data_264…
contigs    126
contigs (>= 0 bp)  521
contigs (>= 1000 bp)   111
contigs (>= 5000 bp)   71
contigs (>= 10000 bp)  55
contigs (>= 25000 bp)  27
contigs (>= 50000 bp)  8
Largest contig  141241
Total length    2030998
Total length (>= 0 bp)  2080970
Total length (>= 1000 bp)   2021028
Total length (>= 5000 bp)   1907111
Total length (>= 10000 bp)  1778686
Total length (>= 25000 bp)  1313173
Total length (>= 50000 bp)  619159
N50 34972
N75 19093
L50 18
L75 37
GC (%)  51.88
Mismatches
N's 0
N's per 100 kbp 0


I have two main questions:

• Is it recommended to filter out small size contigs? Where to put the threshold? I retain only 111 contigs (a reasonable number to deposite on genbank?) if I filter out contigs <1000 nt.

• Could I assemble contigs by hand? For instance, looking for overlapping regions at the edges of two contiguous contigs (previously sorting by mauves).

Sorry about my poor knowledge on the field, maybe some questions are silly.

I appreciate some help! Thanks in advance :)

Why not map your small contigs to a highly related reference genome to see what you would exclude ?

I would also do structural and functional annotation of contigs using eg prokka, interproscan or blast2go before excluding any short contigs. Why not look at the papers in GenomeAnnouncements or current assemblies in GenBank to see what size contigs are the smallest ? I would tend towards excluding contigs <200bp or <500bp rather than < 1000bp.

I would only hand assemble contigs if they really clearly fit to a reference genome exactly, eg, with 0bp gaps between two exactly aligned contigs. That'll be lots of work though!

OK, I will try to exclude small contigs mapping it against a reference genome (using mauves). Anyway, would you recommend me any software to re-assemble contigs into scaffolds? Thanks

No. Unless you have long range information you're not going to get far. Otherwise a complete reassembly might be useful. Else the gold standard now is long reads, followed correction of the contigs by Illumina with Pilon / Racon