Hi all,
I am a newbie to metagenomics and it is often very confusing on how to analyse my data. I have used the Illumina NextSeq (2 x 150) to sequence a microbial community.
I have used fastQC and trimmomatic for the quality control, and I have assembled the sequences using IDBA-UD. In IDBA-UD, I used command mink=20, maxk=100 for constructing de Bruijn Graph.
There are a lot of output files namely (contig-20.fa, contig-40.fa.....contig-100.fa, contig.fa and scaffold.fa). I would like to do functional annotations and maybe later binning.
Here are the questions:
- Which file(s) should I use? I have the log file showing the statistics but I don't know what criteria should I choose upon.
- What programs do you suggest for functional annotation?
- I intend to use MetaBat for binning, but it needs a BAM file, how can I generate a BAM file?
Thanks for your time on reading my question, if you need anything to be clarified, please let me know.
Cheers and many thanks
Alan
Thank you Asaf.
I was in the IDBA google group but someone suggested using contig.fa as "Scaffold file gonna have lots of Ns (not useful for alignment)" https://groups.google.com/forum/#!topic/hku-idba/D8D46jDjXHE . She suggested that using contig.fa is better for performing BLAST.
Is scaffold.fa better than contig.fa for binning?
Should I use contig.fa for annotation?
Cheers
Alan
You can check how many N's you actually have in your data. I don't think it really matters for annotation (you'll get partial proteins anyway). For binning scaffolds might be more useful though.
Thanks again. How can I check? Is it in the Log file? Also, how do you determine quality of the assembly? I have got n50 of ~1000? Is it too low? If so, how can I improve the quality of the assembly?
It's pretty low... An average protein is 1000 bp long so half of the assembly will contain fragmented proteins. You can check the number of N's in the sequence itself, you can also compare N50 of the contigs to the scaffolds. You can try and assemble with metaSPAdes 3.9.0, it should give better results.
Yeah that was just from the log file of IDBA-UD.
I just ran QUAST on the scaffold.fa and it states n50=1942 while the n50 of contig.fa is 1540. This is the contigs generated from mink=20 and maxk=100.
I ran IDBA-UD again setting mink=100, maxk=121, and Quast shows that n50 of scaffold.fa rises to 3334. However at the same time number of contigs decreased 10 times. (from 52k to 6.1k)
N's per 100kbp ranged from 32 to 38 for scaffold.fa
I will try running IDBA again with mink=60 and maxk=124 to see what I can get.
I wouldn't recommend to raise mink. I still suggest to run spades
Thanks for all the suggestions.
Hi Asaf,
Is it necessary to map my reads before I use prodigal for protein prediction?
Or is mapping reads only necessary for downstream binning?
Cheers
Alan
Hi Asaf,
Is it necessary to map my reads before I use prodigal for protein prediction?
Or is mapping reads only necessary for downstream binning?
Cheers
Alan