Question

How To Choose Metagenomics Function annotation : Reads-based v.s. Contigs-based

0

Entering edit mode

12 months ago

JyiYeung • 0

Hi, I'm new to the field of environmental metagenomics and I want to do some functional annotation for my shotgun metagenome data. From the literature, I came across two types of pipeline for metagenomics function annotation: reads-based v.s. config-based. But I could not determine which one to use with both pros and cons. For reads-based pipeline, the blast done directly on the clean reads, it means more data maybe. But the reads are short, like mostly 150bp, less than many length of target genes. If it will influence the final blast accuracy or efficiency?

For contig-based pipeline, after assembly, there might be a big loss that many sequences could not be assembled for natural samples. If I use it, how to evaluate the assemble result? like CheckM? what level of evaluation standard is reasonable?

I would really appreciate all of you and responses. Sorry if it is a naive question. Thank you!

metagenomics function annotation reads contigs • 877 views

ADD COMMENT • link 12 months ago by JyiYeung • 0

score 2 · Answer 1 · 2023-04-27

Hi,

It really depends on your data and what you are working on (eukaryotes or prokaryotes). However, I will try to answer your question.

Actually, you already answered yourself in your question. Read-based approach is not very robust, especially for the functional annotation, since they are very short. Think about it, you will translate the nucleotide into amino acid to match with the database. You have 150~ bp and when you translate it 150/3 = 30 bp aa and you do not know the reads come from which part of protein. Do you think it is enough?

However, If you have ultra low-coverage data and when you perform de novo assembly, you do not have reasonable result, then you might not have chance to go for the downstream analysis.

In read-based; there are different options that I know;

1- blastx

2- mmseqs

3- fraggenescan - gene prediction on short reads

In Contig-based approach; you can assess your assembly using;

1- map the reads to assembly to see the coverage, then you can understand how much information you lost.

2- QUAST, basic statistics

3- You can use single-copy orthologous for the MAGs,not assembly. Otherwise, you cannot really evaluate your assembly using single-copy genes because you see many duplicates which is very normal in metagenomic assembly.

I would say that the contig-based approach is safer than the read-based approach for the functional annotation.

Hope it helps.