Hi guys! So, I am an undergraduate student in bioinformatics and I am starting to perform and study metagenomics/microbiome analysis.
I have some data generated through Illumina Hi-Seq. My reads vary between 1-151bp. So, below I will explain shortly what I've done by now: - I cleaned my data, by using FastQC on each sample to identify overrepresented sequences. I removed them using Trimmomatic, together with the SLIDINGWINDOW:4:15 option for extra cleaning.
I didn't set any minimum length for my reads, I am assuming that the more data the better BUT I don't know exactly about that. So, there is any minimum read length to perform metagenomic analysis?
Thanks in advance.
Did you remove over-represented sequences as idenfied by FastQC? Why? You should remove adapters and other contaminants, but not necessarily over-represented sequences: these may represent the more abundant organism in your dataset, not contamination of any kind.
You should explain in more detail your data (amplicon? shotgun metagenomics?) and what you mean by perform metagenomic analysis. But, as already pointed, 1bp "reads" are useless. I would use, as a bare minimum, 35bp reads, but this length is already too short.
That is interesting, do you use Trimmomatic too? I mean, it provides some Fasta Files with adapters for some Illumina kits (Mi-Seq, Hi-Seq etc).
The data comes from Shotgun Metagenomics, I tried to upload some images in my answer to Buffo but I don't think it's working. Anyways, the majority of my reads are indeed over 100bp with a phred score of 30+.
I generally use bbduk.sh with the bundled list of adapter sequences. Sometimes I use UniVec to remove contaminants, or remove some obvious contaminants from specific projects - e.g., remove human reads from shotgun {soil,water,insect gut} metagenomics. But I don't remove the sequences flagged as "over-represented" by FastQC.