Question: How to reduce cross mapping alignments with paired-end reads ?
0
gravatar for David
2.2 years ago by
David160
David160 wrote:

Hi,

I´m trying to align paired-end reads (illumina) coming from a metagenomics experiment to a custom database of bacterial species.

I´m using bwa-mem (althought could change program if required) to align reads to my genomes. It works pretty well but i have some reads mapping to two different bacterial species. I know this is normal and occurs because those genomes a very close related.

In my case i have Species-A mapped by 2200 reads where as close related speciesB is mapped by 800reads. Only speciesA is present in my sample so clearly speciesB is a false positive.

Is there any way to try to reduce false positives by tuning bwa in order to minimize the impact of cross mapping reads ??

thanks , david

bwa dnaseq genome • 1.1k views
ADD COMMENTlink modified 2.2 years ago • written 2.2 years ago by David160
1

In case you are absolutely sure than B is not in the sample, why not removing B from the database?

ADD REPLYlink written 2.2 years ago by ATpoint26k

Agree with ATPoint , why do you align against Species B ref if it's not possible to find it ?

You can still change your parameters to filter reads with Mismatch penalty , but you will got less aligned reads in global mapping.

ADD REPLYlink modified 2.2 years ago • written 2.2 years ago by Titus910

I´m testing my pipeline with a known MOCK but the idea is to apply the same pipeline to unknown metagenomic samples so removing species from my database makes no sense.

ADD REPLYlink written 2.2 years ago by David160
1
gravatar for genomax
2.2 years ago by
genomax75k
United States
genomax75k wrote:

BBSplit from BBMap is designed for this type of application. Check this SA thread for information. You can choose how to handle the multi-mapping reads (ones that map across genomes). We have used this tool with a pool of 20 species with great results.

ADD COMMENTlink modified 2.2 years ago • written 2.2 years ago by genomax75k
0
gravatar for Kevin Blighe
2.2 years ago by
Kevin Blighe52k
Kevin Blighe52k wrote:

The choice of aligner should be made based on the read lengths that you've got. BWA mem is optimised for reads > 70bp. If you've got shorter reads, then you should consider bowtie/bowtie2 or the older BWA algorithm.

If you wanted to attempt to increase specificity of mapping for the species that you know was in your sample(s), then you could eliminate reads that are below a certain length (say, 70bp), which can be done using cutadapt or Trim Galore!, and then perform the alignment. Longer read lengths result in more specific matching, of course.

Also it would be useful to run FastQC on your reads in order to see how the base quality looks across the reads. If it's poor base quality, then they'll map to dozens of species even outside your database due to false base calls. Generally, you should only have bases with Phred-scaled qualities > 20 or 30 at the read ends.

With bowtie/bowtie2, you can also add a useful set of parameters to ensure that only uniquely-mapped reads are retained. Taking it direct from my notes:

Use -m 1 and --best options to only keep uniquely matched reads (and the one with the 'best' MAPQ)

A few things to consider and try.

Kevin

ADD COMMENTlink written 2.2 years ago by Kevin Blighe52k
1

Thanks for the update kevin,

It´s a 2x150bp experiment. I´ve done all the above trimming and QC on the reads prior to mapping them on the database.

Thanks for the bowtie2 notes , i might consider give it a try.

david

ADD REPLYlink written 2.2 years ago by David160

Did you check if your pairs map on the same ref ?

ADD REPLYlink written 2.2 years ago by Titus910
0
gravatar for David
2.2 years ago by
David160
David160 wrote:

Yes, they both map, genomes are very close so i suspect there is no real solution ???

ADD COMMENTlink written 2.2 years ago by David160

Please use ADD COMMENT/ADD REPLY when responding to existing posts to keep threads logically organized.

This comment should go up against @Titus's.

ADD REPLYlink written 2.2 years ago by genomax75k

Yes there is plenty of solution :

  • You can make a filter on the % of coverage of your genomes ( in your case you will get only this gene covered on your Species B reference isn't it ? )

  • You can select for each ref of your species only specific part etc ...

The solution will depends on what you want at the end.

By the way i m not sure "false positive" in your title are appropriate if your sequences are identical...

ADD REPLYlink modified 2.2 years ago • written 2.2 years ago by Titus910

Yes , that´s what i thought, working with a certain % coverage, it makes a lot of sense ??

Is there any database that list all ncbi genome lengths for bacterial ?? or should i create it myself?

Agree with the title, might not be appropriate

ADD REPLYlink written 2.2 years ago by David160

Well it make sense if in your genome you have only this gene covered , depends on the genome size so in % coverage etc ...

I thinks genome lengths bacterial are available (for the most popular). If you have fasta you can simply calculate the length :)

ADD REPLYlink modified 2.2 years ago • written 2.2 years ago by Titus910
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1886 users visited in the last hour