Question

Read length for metagenomics analysis

1

Entering edit mode

4.7 years ago

magnuskerber ▴ 20

Hi guys! So, I am an undergraduate student in bioinformatics and I am starting to perform and study metagenomics/microbiome analysis.

I have some data generated through Illumina Hi-Seq. My reads vary between 1-151bp. So, below I will explain shortly what I've done by now: - I cleaned my data, by using FastQC on each sample to identify overrepresented sequences. I removed them using Trimmomatic, together with the SLIDINGWINDOW:4:15 option for extra cleaning.

I didn't set any minimum length for my reads, I am assuming that the more data the better BUT I don't know exactly about that. So, there is any minimum read length to perform metagenomic analysis?

Thanks in advance.

gene assembly metagenomic reads • 3.2k views

ADD COMMENT • link updated 4.7 years ago by Buffo ★ 2.4k • written 4.7 years ago by magnuskerber ▴ 20

1

Entering edit mode

I cleaned my data, by using FastQC on each sample to identify overrepresented sequences.

Did you remove over-represented sequences as idenfied by FastQC? Why? You should remove adapters and other contaminants, but not necessarily over-represented sequences: these may represent the more abundant organism in your dataset, not contamination of any kind.

You should explain in more detail your data (amplicon? shotgun metagenomics?) and what you mean by perform metagenomic analysis. But, as already pointed, 1bp "reads" are useless. I would use, as a bare minimum, 35bp reads, but this length is already too short.

ADD REPLY • link 4.7 years ago by h.mon 35k

0

Entering edit mode

Did you remove over-represented sequences as idenfied by FastQC? Why? You should remove adapters and other contaminants, but not necessarily over-represented sequences: these may represent the more abundant organism in your dataset, not contamination of any kind.

That is interesting, do you use Trimmomatic too? I mean, it provides some Fasta Files with adapters for some Illumina kits (Mi-Seq, Hi-Seq etc).

You should explain in more detail your data (amplicon? shotgun metagenomics?) and what you mean by perform metagenomic analysis. But, as already pointed, 1bp "reads" are useless. I would use, as a bare minimum, 35bp reads, but this length is already too short.

The data comes from Shotgun Metagenomics, I tried to upload some images in my answer to Buffo but I don't think it's working. Anyways, the majority of my reads are indeed over 100bp with a phred score of 30+.

ADD REPLY • link updated 4.7 years ago by h.mon 35k • written 4.7 years ago by magnuskerber ▴ 20

1

Entering edit mode

I generally use bbduk.sh with the bundled list of adapter sequences. Sometimes I use UniVec to remove contaminants, or remove some obvious contaminants from specific projects - e.g., remove human reads from shotgun {soil,water,insect gut} metagenomics. But I don't remove the sequences flagged as "over-represented" by FastQC.

ADD REPLY • link 4.7 years ago by h.mon 35k

score 4 · Answer 1 · 2019-08-07

4

Entering edit mode

4.7 years ago

JC 13k

A lot depends on what is the next analysis, the minimal size will be depending on your aligner or kmer-classificator. But in general, anything shorter than 25 bases cannot be informative.

The problem here is if the sequence is too short, there is a high probability to have a random hit and you will be wasting computing time.

ADD COMMENT • link 4.7 years ago by JC 13k

0

Entering edit mode

Yes, I thought about that. The thing is I didn't find any defined number for read size, I will use MEGAHIT for assembling. Do you use 25bp in a regular basis?

ADD REPLY • link 4.7 years ago by magnuskerber ▴ 20

0

Entering edit mode

To paraphrase what JC said: a read of any length is useless if it can't overlap with another read, because that's the only way to extend it. With that in mind, short overlaps are not helpful either, because they can occur by chance. The minimal informative overlap is dependent on the complexity of the assembled (meta)genome, but 25 is a reasonable cut-off. This means that a 25-nucleotide read would barely be long enough to reliably overlap with other reads, but wouldn't really extend them very much.

ADD REPLY • link 4.7 years ago by Mensur Dlakic ★ 27k

score 4 · Answer 2 · 2019-08-07

4

Entering edit mode

4.7 years ago

Mensur Dlakic ★ 27k

You have so few reads below 70 bp that it really doesn't matter. Let me put it this way: making a cut-off at either 50 or 70 bp will definitely not hurt your assembly. I don't think you need to go above 70.

ADD COMMENT • link 4.7 years ago by Mensur Dlakic ★ 27k

0

Entering edit mode

Ok, thank you Mensur. I'll make a 70bp cut then. My data has a good quality so that is a good thing. I have a question about the overrepresented sequences removal, in my case they are around 50bp ( I know that by making a 70bp cut they'll be removed, but I'm asking for curiosity/knowledge). Here is an example of overrepresented sequences found in a sample:

Ok, some are adapters. But for example when I BLAST the third sequence from top-down I get something like this:

upload

I get many matches, with different organisms. So, how to work with these over-represented sequences? Wouldn't removing all of them prior to the analysis avoid me false positives? What is your opinion about this matter?

PS.: Sorry about the long comment. But biostars let me comment only 5 times/day so I'm using my resources the best way I can.

ADD REPLY • link 4.7 years ago by magnuskerber ▴ 20

3

Entering edit mode

You need to clean up your reads first, use trimmomatic/cutadapt to remove adapters before doing anything

ADD REPLY • link 4.7 years ago by JC 13k

3

Entering edit mode

Trim adapters first (always). You may check the overrepresented kmers after trimming adapters, but be careful. If I am not mistaken, latest versions of FastQC have the over-represented module disabled by default, as the authors considered it to be potentially misleading.

What makes you think the bacteria from the over-represented sequences are false positives? You have shotgun metagenomics data, after all, and these sequences may belong to the dominant bacteria from your samples.

ADD REPLY • link 4.7 years ago by h.mon 35k

3

Entering edit mode

I will echo what JC and h.mon told you: make sure to remove the adapters. A tool that has worked well for me and hasn't been mentioned here is AdapterRemoval. Other than that, I would not remove any over-represented sequences. Specifically, there is nothing surprising in finding a large number of rRNA sequences, as many bacterial species have half a dozen or so rrn operon copies. Any conserved piece of rRNA will be present in many if not all bacterial species in your sample, and possibly in 3-8 copies. Again, unless you have separate evidence that some of these over-represented sequences are non-bacterial repeats, I would not exclude them.

ADD REPLY • link 4.7 years ago by Mensur Dlakic ★ 27k

0

Entering edit mode

Thank you all, your answers are helping me a lot. Yes, I will trimm my data always prior to analysis. I got the whole idea about the over-represented sequences, they're not necessarily a bad thing. My library was generated using the TruSeq kit from Illumina. I got the adapter sequences from their website: Read1: AGATCGGAAGAGCACACGTCTGAACTCCAGTCA Read2: AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT

So, I trimmed a sample using those adapter sequences. And the FastQC adapter contamination results shown are:

FastQC flag this graphic with a Green sign, which is good. But some reads appear to have a little bit of contamination, less than 10%. Is that a concerning thing? I guess it's not, but just checking.

ADD REPLY • link 4.7 years ago by magnuskerber ▴ 20

h.mon · Answer 3 · 2019-08-07

1

Entering edit mode

4.7 years ago

Buffo ★ 2.4k

Do you consider useful a read of 1 nucleotide in length?. Amplicon sequencing may cause some redundancy due to the fact you are sequencing a highly conserved region. So you may perform a histogram of the read length distribution and take a decision based on that.

ADD COMMENT • link 4.7 years ago by Buffo ★ 2.4k

0

Entering edit mode

Do you consider useful a read of 1 nucleotide in length? Hahah, of course no. About the highly conserved region the data is not only 16s, I'm not sure if you referred to that.

So you may perform a histogram of the read length distribution and take a decision based on that. So, this is interesting. Below is an example of a FasQC length and quality graphics after trimming.

Length distribution

Quality distribution

ADD REPLY • link updated 4.7 years ago by h.mon 35k • written 4.7 years ago by magnuskerber ▴ 20

1

Entering edit mode

I have added the correct links to the images, please check How to add images to a Biostars post .

ADD REPLY • link 4.7 years ago by h.mon 35k

0

Entering edit mode

Thanks h.mon, those are the images. So, I think I'll use a cutoff of 50bp. Is that ok based on the graphics? Or can I go a little bit higher, like 70-90?

ADD REPLY • link 4.7 years ago by magnuskerber ▴ 20