Question

Checking mutation frequency for every base in metagenome amplicon sequencing

0

Entering edit mode

7.0 years ago

wanderingstefan ▴ 30

Hey all!

I have 2x251 bp MiSeq amplicon sequencing data of a gene from environmental metagenomes. The amplicon, a part of a bacterial topoisomerase gene, has a size of 350bp. After merging the reads, I want to map them to a reference sequence and check the mutation frequency for each base relative to the reference. My first idea would be to create a MSA from the reference and the reads and then just go through every position in each entry of the resulting alignment file. However, since I have around 1 million sequences per sample, I fear that both creating and parsing the alignment file would be really slow...Does someone have experience with this/knows useful tools or can think of another way of doing this?

Thanks for your input!

*Edit**

Sorry if I was unclear, I am quite new to metagenomics and my understanding/definition of terms is still developing. I have amplicon sequencing data from several E. coli dominated metagenomes. The amplicon is a subregion of a bacterial topoisomerase. I want to check the mutation frequency at different positions for each of the genomes, which have been subjected to different treatments. So I want to compare each read to the reference sequence and see how frequent mutations at the specific site are under the different conditions. I want to do this for nucleotide sequences first, later eventually translate the reads and do it for amino acid positions. Hope that clarifies it a bit, sorry again for being unclear.

sequencing alignment gene amplicon • 2.2k views

ADD COMMENT • link 7.0 years ago by wanderingstefan ▴ 30

2

Entering edit mode

Merge, scan/trim to remove adapter contamination, map with bbmap and the call variants using callvariants.sh. All using BBMap suite.

ADD REPLY • link 7.0 years ago by GenoMax 142k

0

Entering edit mode

Can't you just map the reads to the reference genome and make a vcf of it. You only need to know the differences between your sample and the references, right (not of every base I assume).

ADD REPLY • link 7.0 years ago by Benn 8.3k

score 1 · Answer 1 · 2017-04-25

1

Entering edit mode

7.0 years ago

WouterDeCoster 47k

You are following a quite unconventional approach which I wouldn't recommend for amplicon sequencing.

After merging the reads

You shouldn't merge reads. Just leave them as a pair
Just use an aligner such as bwa mem to map your reads to the reference genome. Perfect tool for the job.
Do variant calling using GATK or samtools/bcftools

ADD COMMENT • link 7.0 years ago by WouterDeCoster 47k

1

Entering edit mode

Big this! This is a "solved" (in as much as anything in bioinformatics is solved) problem. Don't re-invent the wheel. Spend some time reading the basic literature around the topic to get a feel for the different tools out there. What toolchain to use does depend on the nature of your reference. If you're working with humans in particular, or mammals in general, go with the GATK. If you have a different organism you may need to get into more boutique territory.

ADD REPLY • link 7.0 years ago by DG 7.3k

0

Entering edit mode

Hey, thanks for your reply. Why would you recommend not to merge the reads? The size of my amplicon is around 350 bp, so merged forward and reverse reads should cover it completely. Data is microbial metagenomes btw.

ADD REPLY • link 7.0 years ago by wanderingstefan ▴ 30

1

Entering edit mode

It would have been wise to add that info about the metagenome in your initial question.

To answer a question here, seems to become more and more a lottery. Did I guessed the question right?

ADD REPLY • link 7.0 years ago by Benn 8.3k

0

Entering edit mode

True, edited the question

ADD REPLY • link 7.0 years ago by wanderingstefan ▴ 30

0

Entering edit mode

Okay, important information... try to be as complete as possible when asking questions!

Disclaimer: I'm working on human genetics, so I might make mistakes with regard to metagenomics.

I guess your first step would be to find out which species are present in the sample? Then you can align your reads to those references and do variant calling...

About merging reads: there is absolutely no advantage for merging, and all or most aligners can handle two fastq files (one for each direction).

ADD REPLY • link 7.0 years ago by WouterDeCoster 47k

0

Entering edit mode

This amplicon sequencing and if the reads are overlapping (as indicated by OP) then merging them makes sense.

ADD REPLY • link 7.0 years ago by GenoMax 142k

0

Entering edit mode

Okay, I guess it won't hurt, but would you say it has an added value? For variant calling strand-bias is important and having reads from both directions support a mutation is a plus, no?

ADD REPLY • link 7.0 years ago by WouterDeCoster 47k

0

Entering edit mode

Depends on where the variant is. In this case a 150 bp overlap would be expected between R1/R2 and any real variants in that region will be supported by both reads. For the rest of the region there would be support from only one read/strand. It is not clear from original post but it appears that only some regions of the metagenome are being amplified.

ADD REPLY • link 7.0 years ago by GenoMax 142k

score 1 · Answer 2 · 2017-04-25

1

Entering edit mode

7.0 years ago

Benn 8.3k

If you have metagenomics data, you can try to do de novo assembly (e.g. MIRA or trinity). Most assemblers can handle paired-end data so you don't have to merge anything at all.

After assembly you can do MSA with the contigs. You can use ClustalW for this. If your sequences are very alike, you can use ADOMA ADOMA: A Command Line Tool to Modify ClustalW Multiple Alignment Output to show only the differences in your alignment.

ADD COMMENT • link 7.0 years ago by Benn 8.3k

0

Entering edit mode

Hey, thanks for your answer. If I have 2x251 bp reads and my amplicon is 350bp long, would assembly not be an overkill? Theoretically one should then be able through 'assemble' every amplicon just through merging two paired reads or?

ADD REPLY • link 7.0 years ago by wanderingstefan ▴ 30

0

Entering edit mode

How many different species do you expect in your sample? I hope less then the number of reads, right? With assembly you kinda 'merge' all the exact similar reads (with the assumption that they are of the same species).

I don't know if this is true for your gene of course.

ADD REPLY • link 7.0 years ago by Benn 8.3k

0

Entering edit mode

Yes, I expect very few species in these samples, E.coli being the great majority. The gene in question is quite conserved between species.

ADD REPLY • link 7.0 years ago by wanderingstefan ▴ 30

0

Entering edit mode

Then I would first try to assemble it into contigs, and use the contigs for MSA.

ADD REPLY • link 7.0 years ago by Benn 8.3k

0

Entering edit mode

The gene in question is quite conserved between species.

Won't the assembly produce a single contig in that case?

We are still missing important information about what is being done here. How many genomes are there?

ADD REPLY • link 7.0 years ago by GenoMax 142k