Question: Checking mutation frequency for every base in metagenome amplicon sequencing
0
gravatar for wanderingstefan
2.1 years ago by
wanderingstefan30 wrote:

Hey all!

I have 2x251 bp MiSeq amplicon sequencing data of a gene from environmental metagenomes. The amplicon, a part of a bacterial topoisomerase gene, has a size of 350bp. After merging the reads, I want to map them to a reference sequence and check the mutation frequency for each base relative to the reference. My first idea would be to create a MSA from the reference and the reads and then just go through every position in each entry of the resulting alignment file. However, since I have around 1 million sequences per sample, I fear that both creating and parsing the alignment file would be really slow...Does someone have experience with this/knows useful tools or can think of another way of doing this?

Thanks for your input!

*Edit**

Sorry if I was unclear, I am quite new to metagenomics and my understanding/definition of terms is still developing. I have amplicon sequencing data from several E. coli dominated metagenomes. The amplicon is a subregion of a bacterial topoisomerase. I want to check the mutation frequency at different positions for each of the genomes, which have been subjected to different treatments. So I want to compare each read to the reference sequence and see how frequent mutations at the specific site are under the different conditions. I want to do this for nucleotide sequences first, later eventually translate the reads and do it for amino acid positions. Hope that clarifies it a bit, sorry again for being unclear.

ADD COMMENTlink modified 2.1 years ago • written 2.1 years ago by wanderingstefan30
2

Merge, scan/trim to remove adapter contamination, map with bbmap and the call variants using callvariants.sh. All using BBMap suite.

ADD REPLYlink modified 2.1 years ago • written 2.1 years ago by genomax67k

Can't you just map the reads to the reference genome and make a vcf of it. You only need to know the differences between your sample and the references, right (not of every base I assume).

ADD REPLYlink written 2.1 years ago by Benn6.8k
1
gravatar for WouterDeCoster
2.1 years ago by
Belgium
WouterDeCoster39k wrote:

You are following a quite unconventional approach which I wouldn't recommend for amplicon sequencing.

After merging the reads

  1. You shouldn't merge reads. Just leave them as a pair
  2. Just use an aligner such as bwa mem to map your reads to the reference genome. Perfect tool for the job.
  3. Do variant calling using GATK or samtools/bcftools
ADD COMMENTlink written 2.1 years ago by WouterDeCoster39k
1

Big this! This is a "solved" (in as much as anything in bioinformatics is solved) problem. Don't re-invent the wheel. Spend some time reading the basic literature around the topic to get a feel for the different tools out there. What toolchain to use does depend on the nature of your reference. If you're working with humans in particular, or mammals in general, go with the GATK. If you have a different organism you may need to get into more boutique territory.

ADD REPLYlink written 2.1 years ago by Dan Gaston7.1k

Hey, thanks for your reply. Why would you recommend not to merge the reads? The size of my amplicon is around 350 bp, so merged forward and reverse reads should cover it completely. Data is microbial metagenomes btw.

ADD REPLYlink modified 2.1 years ago • written 2.1 years ago by wanderingstefan30
1

It would have been wise to add that info about the metagenome in your initial question.

To answer a question here, seems to become more and more a lottery. Did I guessed the question right?

ADD REPLYlink modified 2.1 years ago • written 2.1 years ago by Benn6.8k

True, edited the question

ADD REPLYlink written 2.1 years ago by wanderingstefan30

Okay, important information... try to be as complete as possible when asking questions!

Disclaimer: I'm working on human genetics, so I might make mistakes with regard to metagenomics.

I guess your first step would be to find out which species are present in the sample? Then you can align your reads to those references and do variant calling...

About merging reads: there is absolutely no advantage for merging, and all or most aligners can handle two fastq files (one for each direction).

ADD REPLYlink written 2.1 years ago by WouterDeCoster39k

This amplicon sequencing and if the reads are overlapping (as indicated by OP) then merging them makes sense.

ADD REPLYlink modified 2.1 years ago • written 2.1 years ago by genomax67k

Okay, I guess it won't hurt, but would you say it has an added value? For variant calling strand-bias is important and having reads from both directions support a mutation is a plus, no?

ADD REPLYlink written 2.1 years ago by WouterDeCoster39k

Depends on where the variant is. In this case a 150 bp overlap would be expected between R1/R2 and any real variants in that region will be supported by both reads. For the rest of the region there would be support from only one read/strand. It is not clear from original post but it appears that only some regions of the metagenome are being amplified.

ADD REPLYlink modified 2.1 years ago • written 2.1 years ago by genomax67k
1
gravatar for Benn
2.1 years ago by
Benn6.8k
Netherlands
Benn6.8k wrote:

If you have metagenomics data, you can try to do de novo assembly (e.g. MIRA or trinity). Most assemblers can handle paired-end data so you don't have to merge anything at all.

After assembly you can do MSA with the contigs. You can use ClustalW for this. If your sequences are very alike, you can use ADOMA ADOMA: A Command Line Tool to Modify ClustalW Multiple Alignment Output to show only the differences in your alignment.

ADD COMMENTlink written 2.1 years ago by Benn6.8k

Hey, thanks for your answer. If I have 2x251 bp reads and my amplicon is 350bp long, would assembly not be an overkill? Theoretically one should then be able through 'assemble' every amplicon just through merging two paired reads or?

ADD REPLYlink written 2.1 years ago by wanderingstefan30

How many different species do you expect in your sample? I hope less then the number of reads, right? With assembly you kinda 'merge' all the exact similar reads (with the assumption that they are of the same species).

I don't know if this is true for your gene of course.

ADD REPLYlink written 2.1 years ago by Benn6.8k

Yes, I expect very few species in these samples, E.coli being the great majority. The gene in question is quite conserved between species.

ADD REPLYlink written 2.1 years ago by wanderingstefan30

Then I would first try to assemble it into contigs, and use the contigs for MSA.

ADD REPLYlink written 2.1 years ago by Benn6.8k

The gene in question is quite conserved between species.

Won't the assembly produce a single contig in that case?

We are still missing important information about what is being done here. How many genomes are there?

ADD REPLYlink modified 2.1 years ago • written 2.1 years ago by genomax67k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 802 users visited in the last hour