Question: How to reduce cross mapping alignments with paired-end reads ?
0
gravatar for David
18 months ago by
David150
David150 wrote:

Hi,

I´m trying to align paired-end reads (illumina) coming from a metagenomics experiment to a custom database of bacterial species.

I´m using bwa-mem (althought could change program if required) to align reads to my genomes. It works pretty well but i have some reads mapping to two different bacterial species. I know this is normal and occurs because those genomes a very close related.

In my case i have Species-A mapped by 2200 reads where as close related speciesB is mapped by 800reads. Only speciesA is present in my sample so clearly speciesB is a false positive.

Is there any way to try to reduce false positives by tuning bwa in order to minimize the impact of cross mapping reads ??

thanks , david

bwa dnaseq genome • 849 views
ADD COMMENTlink modified 18 months ago • written 18 months ago by David150
1

In case you are absolutely sure than B is not in the sample, why not removing B from the database?

ADD REPLYlink written 18 months ago by ATpoint14k

Agree with ATPoint , why do you align against Species B ref if it's not possible to find it ?

You can still change your parameters to filter reads with Mismatch penalty , but you will got less aligned reads in global mapping.

ADD REPLYlink modified 18 months ago • written 18 months ago by Titus770

I´m testing my pipeline with a known MOCK but the idea is to apply the same pipeline to unknown metagenomic samples so removing species from my database makes no sense.

ADD REPLYlink written 18 months ago by David150
1
gravatar for genomax
18 months ago by
genomax64k
United States
genomax64k wrote:

BBSplit from BBMap is designed for this type of application. Check this SA thread for information. You can choose how to handle the multi-mapping reads (ones that map across genomes). We have used this tool with a pool of 20 species with great results.

ADD COMMENTlink modified 18 months ago • written 18 months ago by genomax64k
0
gravatar for Kevin Blighe
18 months ago by
Kevin Blighe39k
Republic of Ireland
Kevin Blighe39k wrote:

The choice of aligner should be made based on the read lengths that you've got. BWA mem is optimised for reads > 70bp. If you've got shorter reads, then you should consider bowtie/bowtie2 or the older BWA algorithm.

If you wanted to attempt to increase specificity of mapping for the species that you know was in your sample(s), then you could eliminate reads that are below a certain length (say, 70bp), which can be done using cutadapt or Trim Galore!, and then perform the alignment. Longer read lengths result in more specific matching, of course.

Also it would be useful to run FastQC on your reads in order to see how the base quality looks across the reads. If it's poor base quality, then they'll map to dozens of species even outside your database due to false base calls. Generally, you should only have bases with Phred-scaled qualities > 20 or 30 at the read ends.

With bowtie/bowtie2, you can also add a useful set of parameters to ensure that only uniquely-mapped reads are retained. Taking it direct from my notes:

Use -m 1 and --best options to only keep uniquely matched reads (and the one with the 'best' MAPQ)

A few things to consider and try.

Kevin

ADD COMMENTlink written 18 months ago by Kevin Blighe39k
1

Thanks for the update kevin,

It´s a 2x150bp experiment. I´ve done all the above trimming and QC on the reads prior to mapping them on the database.

Thanks for the bowtie2 notes , i might consider give it a try.

david

ADD REPLYlink written 18 months ago by David150

Did you check if your pairs map on the same ref ?

ADD REPLYlink written 18 months ago by Titus770
0
gravatar for David
18 months ago by
David150
David150 wrote:

Yes, they both map, genomes are very close so i suspect there is no real solution ???

ADD COMMENTlink written 18 months ago by David150

Please use ADD COMMENT/ADD REPLY when responding to existing posts to keep threads logically organized.

This comment should go up against @Titus's.

ADD REPLYlink written 18 months ago by genomax64k

Yes there is plenty of solution :

  • You can make a filter on the % of coverage of your genomes ( in your case you will get only this gene covered on your Species B reference isn't it ? )

  • You can select for each ref of your species only specific part etc ...

The solution will depends on what you want at the end.

By the way i m not sure "false positive" in your title are appropriate if your sequences are identical...

ADD REPLYlink modified 18 months ago • written 18 months ago by Titus770

Yes , that´s what i thought, working with a certain % coverage, it makes a lot of sense ??

Is there any database that list all ncbi genome lengths for bacterial ?? or should i create it myself?

Agree with the title, might not be appropriate

ADD REPLYlink written 18 months ago by David150

Well it make sense if in your genome you have only this gene covered , depends on the genome size so in % coverage etc ...

I thinks genome lengths bacterial are available (for the most popular). If you have fasta you can simply calculate the length :)

ADD REPLYlink modified 18 months ago • written 18 months ago by Titus770
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 784 users visited in the last hour