Question: Read length for metagenomics analysis
0
gravatar for magnuskerber
3 months ago by
magnuskerber0 wrote:

Hi guys! So, I am an undergraduate student in bioinformatics and I am starting to perform and study metagenomics/microbiome analysis.

I have some data generated through Illumina Hi-Seq. My reads vary between 1-151bp. So, below I will explain shortly what I've done by now: - I cleaned my data, by using FastQC on each sample to identify overrepresented sequences. I removed them using Trimmomatic, together with the SLIDINGWINDOW:4:15 option for extra cleaning.

I didn't set any minimum length for my reads, I am assuming that the more data the better BUT I don't know exactly about that. So, there is any minimum read length to perform metagenomic analysis?

Thanks in advance.

metagenomic reads assembly gene • 339 views
ADD COMMENTlink modified 3 months ago by Buffo1.7k • written 3 months ago by magnuskerber0

I cleaned my data, by using FastQC on each sample to identify overrepresented sequences.

Did you remove over-represented sequences as idenfied by FastQC? Why? You should remove adapters and other contaminants, but not necessarily over-represented sequences: these may represent the more abundant organism in your dataset, not contamination of any kind.

You should explain in more detail your data (amplicon? shotgun metagenomics?) and what you mean by perform metagenomic analysis. But, as already pointed, 1bp "reads" are useless. I would use, as a bare minimum, 35bp reads, but this length is already too short.

ADD REPLYlink written 3 months ago by h.mon28k

Did you remove over-represented sequences as idenfied by FastQC? Why? You should remove adapters and other contaminants, but not necessarily over-represented sequences: these may represent the more abundant organism in your dataset, not contamination of any kind.

That is interesting, do you use Trimmomatic too? I mean, it provides some Fasta Files with adapters for some Illumina kits (Mi-Seq, Hi-Seq etc).

You should explain in more detail your data (amplicon? shotgun metagenomics?) and what you mean by perform metagenomic analysis. But, as already pointed, 1bp "reads" are useless. I would use, as a bare minimum, 35bp reads, but this length is already too short.

The data comes from Shotgun Metagenomics, I tried to upload some images in my answer to Buffo but I don't think it's working. Anyways, the majority of my reads are indeed over 100bp with a phred score of 30+.

ADD REPLYlink modified 3 months ago by h.mon28k • written 3 months ago by magnuskerber0

I generally use bbduk.sh with the bundled list of adapter sequences. Sometimes I use UniVec to remove contaminants, or remove some obvious contaminants from specific projects - e.g., remove human reads from shotgun {soil,water,insect gut} metagenomics. But I don't remove the sequences flagged as "over-represented" by FastQC.

ADD REPLYlink written 3 months ago by h.mon28k
3
gravatar for JC
3 months ago by
JC9.1k
Mexico
JC9.1k wrote:

A lot depends on what is the next analysis, the minimal size will be depending on your aligner or kmer-classificator. But in general, anything shorter than 25 bases cannot be informative.

The problem here is if the sequence is too short, there is a high probability to have a random hit and you will be wasting computing time.

ADD COMMENTlink written 3 months ago by JC9.1k

Yes, I thought about that. The thing is I didn't find any defined number for read size, I will use MEGAHIT for assembling. Do you use 25bp in a regular basis?

ADD REPLYlink written 3 months ago by magnuskerber0

To paraphrase what JC said: a read of any length is useless if it can't overlap with another read, because that's the only way to extend it. With that in mind, short overlaps are not helpful either, because they can occur by chance. The minimal informative overlap is dependent on the complexity of the assembled (meta)genome, but 25 is a reasonable cut-off. This means that a 25-nucleotide read would barely be long enough to reliably overlap with other reads, but wouldn't really extend them very much.

ADD REPLYlink written 3 months ago by Mensur Dlakic2.2k
3
gravatar for Mensur Dlakic
3 months ago by
Mensur Dlakic2.2k
USA
Mensur Dlakic2.2k wrote:

You have so few reads below 70 bp that it really doesn't matter. Let me put it this way: making a cut-off at either 50 or 70 bp will definitely not hurt your assembly. I don't think you need to go above 70.

ADD COMMENTlink written 3 months ago by Mensur Dlakic2.2k

Ok, thank you Mensur. I'll make a 70bp cut then. My data has a good quality so that is a good thing. I have a question about the overrepresented sequences removal, in my case they are around 50bp ( I know that by making a 70bp cut they'll be removed, but I'm asking for curiosity/knowledge). Here is an example of overrepresented sequences found in a sample:

over-represented

Ok, some are adapters. But for example when I BLAST the third sequence from top-down I get something like this:

Screenshot-from-2019-08-09-12-24-03
upload

I get many matches, with different organisms. So, how to work with these over-represented sequences? Wouldn't removing all of them prior to the analysis avoid me false positives? What is your opinion about this matter?

PS.: Sorry about the long comment. But biostars let me comment only 5 times/day so I'm using my resources the best way I can.

ADD REPLYlink written 3 months ago by magnuskerber0
2

You need to clean up your reads first, use trimmomatic/cutadapt to remove adapters before doing anything

ADD REPLYlink written 3 months ago by JC9.1k
2

Trim adapters first (always). You may check the overrepresented kmers after trimming adapters, but be careful. If I am not mistaken, latest versions of FastQC have the over-represented module disabled by default, as the authors considered it to be potentially misleading.

What makes you think the bacteria from the over-represented sequences are false positives? You have shotgun metagenomics data, after all, and these sequences may belong to the dominant bacteria from your samples.

ADD REPLYlink written 3 months ago by h.mon28k
2

I will echo what JC and h.mon told you: make sure to remove the adapters. A tool that has worked well for me and hasn't been mentioned here is AdapterRemoval. Other than that, I would not remove any over-represented sequences. Specifically, there is nothing surprising in finding a large number of rRNA sequences, as many bacterial species have half a dozen or so rrn operon copies. Any conserved piece of rRNA will be present in many if not all bacterial species in your sample, and possibly in 3-8 copies. Again, unless you have separate evidence that some of these over-represented sequences are non-bacterial repeats, I would not exclude them.

ADD REPLYlink written 3 months ago by Mensur Dlakic2.2k

Thank you all, your answers are helping me a lot. Yes, I will trimm my data always prior to analysis. I got the whole idea about the over-represented sequences, they're not necessarily a bad thing. My library was generated using the TruSeq kit from Illumina. I got the adapter sequences from their website: Read1: AGATCGGAAGAGCACACGTCTGAACTCCAGTCA Read2: AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT

So, I trimmed a sample using those adapter sequences. And the FastQC adapter contamination results shown are:

Screenshot-from-2019-08-09-16-00-50

FastQC flag this graphic with a Green sign, which is good. But some reads appear to have a little bit of contamination, less than 10%. Is that a concerning thing? I guess it's not, but just checking.

ADD REPLYlink written 3 months ago by magnuskerber0
0
gravatar for Buffo
3 months ago by
Buffo1.7k
Buffo1.7k wrote:

Do you consider useful a read of 1 nucleotide in length?. Amplicon sequencing may cause some redundancy due to the fact you are sequencing a highly conserved region. So you may perform a histogram of the read length distribution and take a decision based on that.

ADD COMMENTlink written 3 months ago by Buffo1.7k

Do you consider useful a read of 1 nucleotide in length? Hahah, of course no. About the highly conserved region the data is not only 16s, I'm not sure if you referred to that.

So you may perform a histogram of the read length distribution and take a decision based on that. So, this is interesting. Below is an example of a FasQC length and quality graphics after trimming.

Length distribution

Quality distribution

ADD REPLYlink modified 3 months ago by h.mon28k • written 3 months ago by magnuskerber0
1

I have added the correct links to the images, please check How to add images to a Biostars post .

ADD REPLYlink written 3 months ago by h.mon28k

Thanks h.mon, those are the images. So, I think I'll use a cutoff of 50bp. Is that ok based on the graphics? Or can I go a little bit higher, like 70-90?

ADD REPLYlink written 3 months ago by magnuskerber0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2169 users visited in the last hour