Question: Read length for metagenomics analysis
0
gravatar for magnuskerber
13 days ago by
magnuskerber0 wrote:

Hi guys! So, I am an undergraduate student in bioinformatics and I am starting to perform and study metagenomics/microbiome analysis.

I have some data generated through Illumina Hi-Seq. My reads vary between 1-151bp. So, below I will explain shortly what I've done by now: - I cleaned my data, by using FastQC on each sample to identify overrepresented sequences. I removed them using Trimmomatic, together with the SLIDINGWINDOW:4:15 option for extra cleaning.

I didn't set any minimum length for my reads, I am assuming that the more data the better BUT I don't know exactly about that. So, there is any minimum read length to perform metagenomic analysis?

Thanks in advance.

metagenomic reads assembly gene • 204 views
ADD COMMENTlink modified 13 days ago by Buffo1.6k • written 13 days ago by magnuskerber0

I cleaned my data, by using FastQC on each sample to identify overrepresented sequences.

Did you remove over-represented sequences as idenfied by FastQC? Why? You should remove adapters and other contaminants, but not necessarily over-represented sequences: these may represent the more abundant organism in your dataset, not contamination of any kind.

You should explain in more detail your data (amplicon? shotgun metagenomics?) and what you mean by perform metagenomic analysis. But, as already pointed, 1bp "reads" are useless. I would use, as a bare minimum, 35bp reads, but this length is already too short.

ADD REPLYlink written 13 days ago by h.mon27k

Did you remove over-represented sequences as idenfied by FastQC? Why? You should remove adapters and other contaminants, but not necessarily over-represented sequences: these may represent the more abundant organism in your dataset, not contamination of any kind.

That is interesting, do you use Trimmomatic too? I mean, it provides some Fasta Files with adapters for some Illumina kits (Mi-Seq, Hi-Seq etc).

You should explain in more detail your data (amplicon? shotgun metagenomics?) and what you mean by perform metagenomic analysis. But, as already pointed, 1bp "reads" are useless. I would use, as a bare minimum, 35bp reads, but this length is already too short.

The data comes from Shotgun Metagenomics, I tried to upload some images in my answer to Buffo but I don't think it's working. Anyways, the majority of my reads are indeed over 100bp with a phred score of 30+.

ADD REPLYlink modified 13 days ago by h.mon27k • written 13 days ago by magnuskerber0

I generally use bbduk.sh with the bundled list of adapter sequences. Sometimes I use UniVec to remove contaminants, or remove some obvious contaminants from specific projects - e.g., remove human reads from shotgun {soil,water,insect gut} metagenomics. But I don't remove the sequences flagged as "over-represented" by FastQC.

ADD REPLYlink written 13 days ago by h.mon27k
3
gravatar for JC
13 days ago by
JC8.2k
Mexico
JC8.2k wrote:

A lot depends on what is the next analysis, the minimal size will be depending on your aligner or kmer-classificator. But in general, anything shorter than 25 bases cannot be informative.

The problem here is if the sequence is too short, there is a high probability to have a random hit and you will be wasting computing time.

ADD COMMENTlink written 13 days ago by JC8.2k

Yes, I thought about that. The thing is I didn't find any defined number for read size, I will use MEGAHIT for assembling. Do you use 25bp in a regular basis?

ADD REPLYlink written 13 days ago by magnuskerber0

To paraphrase what JC said: a read of any length is useless if it can't overlap with another read, because that's the only way to extend it. With that in mind, short overlaps are not helpful either, because they can occur by chance. The minimal informative overlap is dependent on the complexity of the assembled (meta)genome, but 25 is a reasonable cut-off. This means that a 25-nucleotide read would barely be long enough to reliably overlap with other reads, but wouldn't really extend them very much.

ADD REPLYlink written 13 days ago by Mensur Dlakic540
3
gravatar for Mensur Dlakic
13 days ago by
Mensur Dlakic540
USA
Mensur Dlakic540 wrote:

You have so few reads below 70 bp that it really doesn't matter. Let me put it this way: making a cut-off at either 50 or 70 bp will definitely not hurt your assembly. I don't think you need to go above 70.

ADD COMMENTlink written 13 days ago by Mensur Dlakic540

Ok, thank you Mensur. I'll make a 70bp cut then. My data has a good quality so that is a good thing. I have a question about the overrepresented sequences removal, in my case they are around 50bp ( I know that by making a 70bp cut they'll be removed, but I'm asking for curiosity/knowledge). Here is an example of overrepresented sequences found in a sample:

over-represented

Ok, some are adapters. But for example when I BLAST the third sequence from top-down I get something like this:

Screenshot-from-2019-08-09-12-24-03
upload

I get many matches, with different organisms. So, how to work with these over-represented sequences? Wouldn't removing all of them prior to the analysis avoid me false positives? What is your opinion about this matter?

PS.: Sorry about the long comment. But biostars let me comment only 5 times/day so I'm using my resources the best way I can.

ADD REPLYlink written 11 days ago by magnuskerber0
2

You need to clean up your reads first, use trimmomatic/cutadapt to remove adapters before doing anything

ADD REPLYlink written 11 days ago by JC8.2k
2

Trim adapters first (always). You may check the overrepresented kmers after trimming adapters, but be careful. If I am not mistaken, latest versions of FastQC have the over-represented module disabled by default, as the authors considered it to be potentially misleading.

What makes you think the bacteria from the over-represented sequences are false positives? You have shotgun metagenomics data, after all, and these sequences may belong to the dominant bacteria from your samples.

ADD REPLYlink written 11 days ago by h.mon27k
2

I will echo what JC and h.mon told you: make sure to remove the adapters. A tool that has worked well for me and hasn't been mentioned here is AdapterRemoval. Other than that, I would not remove any over-represented sequences. Specifically, there is nothing surprising in finding a large number of rRNA sequences, as many bacterial species have half a dozen or so rrn operon copies. Any conserved piece of rRNA will be present in many if not all bacterial species in your sample, and possibly in 3-8 copies. Again, unless you have separate evidence that some of these over-represented sequences are non-bacterial repeats, I would not exclude them.

ADD REPLYlink written 11 days ago by Mensur Dlakic540

Thank you all, your answers are helping me a lot. Yes, I will trimm my data always prior to analysis. I got the whole idea about the over-represented sequences, they're not necessarily a bad thing. My library was generated using the TruSeq kit from Illumina. I got the adapter sequences from their website: Read1: AGATCGGAAGAGCACACGTCTGAACTCCAGTCA Read2: AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT

So, I trimmed a sample using those adapter sequences. And the FastQC adapter contamination results shown are:

Screenshot-from-2019-08-09-16-00-50

FastQC flag this graphic with a Green sign, which is good. But some reads appear to have a little bit of contamination, less than 10%. Is that a concerning thing? I guess it's not, but just checking.

ADD REPLYlink written 11 days ago by magnuskerber0
0
gravatar for Buffo
13 days ago by
Buffo1.6k
Buffo1.6k wrote:

Do you consider useful a read of 1 nucleotide in length?. Amplicon sequencing may cause some redundancy due to the fact you are sequencing a highly conserved region. So you may perform a histogram of the read length distribution and take a decision based on that.

ADD COMMENTlink written 13 days ago by Buffo1.6k

Do you consider useful a read of 1 nucleotide in length? Hahah, of course no. About the highly conserved region the data is not only 16s, I'm not sure if you referred to that.

So you may perform a histogram of the read length distribution and take a decision based on that. So, this is interesting. Below is an example of a FasQC length and quality graphics after trimming.

Length distribution

Quality distribution

ADD REPLYlink modified 13 days ago by h.mon27k • written 13 days ago by magnuskerber0
1

I have added the correct links to the images, please check How to add images to a Biostars post .

ADD REPLYlink written 13 days ago by h.mon27k

Thanks h.mon, those are the images. So, I think I'll use a cutoff of 50bp. Is that ok based on the graphics? Or can I go a little bit higher, like 70-90?

ADD REPLYlink written 13 days ago by magnuskerber0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1438 users visited in the last hour