First of all let's say I have 5 samples (S1,S2...S5), and I have duplicates for each samples (S1a, S1b, S2a, S2b....S5b). The are paired end 2x150bp fastq files sequenced from the Next500 Nextera platform
Below is the plan for the metagenomics analysis: 1.QC: Use bbduk and trimmomatic for quality control 2. FastQC to check quality 3.Assemble with Megahit 4.Alignment and mapping reads with bwa/bbmap 5.Binning 6. Use Prodigal for functional gene annotation of the assembled contigs 7. Quantifying the annotated genes of the metagenome and export into a tsv file
And here are the questions: 1. I know I will have to concatenate the duplicate samples (concat S1a and S1b together), but can I concatenate all 10 files together prior to Megahit assembly, and somehow separate the samples so I know how where the quantified genes are from which sample?
The reason I wanted to concatenate samples together is that I get slightly higher mapping rates with larger samples. What is the usual mapping rate for de novo assembly? I am only getting mapping rates of ~30-45%. Is it normal for de novo assembly? And how can I improve the mapping rate?
I downloaded a script that will remove all contigs less than 1000bp right after assembly. Should I do this before mapping reads or after it? (Generally contigs >1kbp may make it easier when binning draft genomes).
Any recommendations for programs that can annotate metagenome against KEGG, COG and CaZy database? The web-based database cannot handle the large size of my samples.
Thank you in advance! I am new in metagenomics analysis and feel free to correct me if I am wrong! :)