No mismatch allowed setting in Bowtie2 (-N 0 does not help)
1
0
Entering edit mode
2.9 years ago
Ankit ▴ 290

Hi,

Does anyone know the setting of the bowtie2 parameter to not to allow any mismatch at all?

-N 0 does not work, maybe because it controls only seed length but not entire read length. Also -L i.e. seed length value does not go beyond 35, So I cannot set it to equal to read length (50 in my case). These results in several of the mismatches in my sam output even with -N 0 and -L 22 (default).

I also did not understand the role of --score-min especially, Constant (C) function. There is no example in the bowtie2 manual for constant function. Is it possible with --score-min to control mismatches?

It would be a great help if somebody has experience on this and can suggest me something.

Thanks

Bowtie2 mismatch --score-min constant alignment • 2.4k views
1
Entering edit mode
2.9 years ago

You may set a very high --mp (mismatch penalty)

--score-min is very well explained in the manual. See section "Valid alignments meet or exceed the minimum score threshold" http://bowtie-bio.sourceforge.net/bowtie2/manual.shtml

0
Entering edit mode

Mismatch penalty has MAX,MIN values. If I set MAX as 10000, what should be the MIN value or it doesn't matter?

Also if I understood the concept correctly, by increasing the mismatch penalty will decrease the alignment score drastically (highly negative) and that cause rejection of the alignment. Is it like that?

In this link, it was suggested to use --score-min C,0,0 to disallow any mismatch. Since in end-to-end mode negative values arise because of mismatch and 0 shows the no mismatch, I guess setting the minimum threshold as 0, leads to rejection of all the reads with alignment score AS:i<0.

I am not sure about the function how C,0,0 works because in the manual example is given for L and G, For S (sqrt) I found in other forum but no clue about Constant (C). I assume C,0,0 is just C=0. and no function just a constant term setting.

Thanks

1
Entering edit mode

MAX, MIN is for penalty according to Quality of base. You may set both to same and a very high number

Also if I understood the concept correctly, by increasing the mismatch penalty will decrease the alignment score drastically (highly negative) and that cause rejection of the alignment. Is it like that?

correct.

In this link, it was suggested to use --score-min C,0,0 to disallow any mismatch. Since in end-to-end mode negative values arise because of mismatch and 0 shows the no mismatch, I guess setting the minimum threshold as 0, leads to rejection of all the reads with alignment score AS:i<0.

This is correct too, and this will also abolish any gaps whatsoever.

However, I am not sure what you want to achieve by doing an end-to.end match without any mismatch/gap. This will be equivalent to "searching" or "grepping" your Query string in Genome without taking any considertion of Qualtiy of bases.

I am not sure about the function how C,0,0 works because in the manual example is given for L and G, For S (sqrt) I found in other forum but no clue about Constant (C). I assume C,0,0 is just C=0. and no function just a constant term setting.

Again you are right. C,0,0 is just their fancy way to say constant function with constant term and zero coefficient, i.e 0(the constant term) + 0(coefficient ) * (1, i.e. the constant function) = 0

0
Entering edit mode

Thank you, Santosh for the important suggestions.

However, I am not sure what you want to achieve by doing an end-to.end match without any mismatch/gap

By not allowing mismatch at all, I want to perform allele-specific analysis of hybrid mouse cells derived from two different mouse strains and paternal and maternal allele are different from each other by millions of SNPs. So I want to align the hybrid mESC reads to both parents ref genome differently so that I can discriminate between the two alleles. Therefore, I need to completely disallow mismatching keeping in mind full read should align. I know it is possible by local alignment also because I did this before by bowtie1. But now my data has a read length >50 bp and I need to switch to Bowtie2.

I am not able to use online software for allele-specific analysis because I don't have a VCF SNPs. I tried calling from BAM of whole genome data. But it gets abruptly terminated.

So I thought I should use the strategy to use coordinates of SNPs (which I got from Supp data of paper) and intersect my reads on it.

Or if you have a better idea please suggest.

Thanks

1
Entering edit mode

Hi Ankit, thanks for the explanation.

Therefore, I need to completely disallow mismatching keeping in mind full read should align.

This way, you will lose a lot of data because this is just a fixed string search, and you know when the base quality is poor, end-to-end match is not possible without mismatch.

You may try QuasR - It runs mostly in automated fashion and needs only the SNPs coordinates (section 6.5)

https://bioconductor.org/packages/3.8/bioc/vignettes/QuasR/inst/doc/QuasR.html#65_allele-specific_analysis

0
Entering edit mode

You may also get the VCF of SNPs if you know the strain

ftp://ftp-mouse.sanger.ac.uk/current_snps/strain_specific_vcfs/

0
Entering edit mode

It seems MAX,MIN is only shown in manual (--mp MX,MN). The downloaded bowtie2 version I have is showing --mp <int>. In that case only one value is required. Well I set 10000 and running with it.

0
Entering edit mode

You may set a very high --mp (mismatch penalty)

I tried --mp 10000. It actually did not remove or rejected any aligned read with a mismatch.

It might be because the minimum threshold was still higher (close to 0). Because I think even if you decrease alignment score highly negative it will not reject the read.

I need to set --score-min 0 or higher to reject the read. In --local mode, the --score-min value can go to positive values (because of match bonus add). So there is no way I can fix the --score-min to 0 or any specific value. Only in the --end-to-end mode, the max --score-min value can be achieved is 0.

So that is the only option I left with and that driven me to opt for the end-to-end mode in bowtie2.

Correct me if I am overthinking.

Thanks

1
Entering edit mode

I tried --mp 10000. It actually did not remove or rejected any aligned read with a mismatch.

It might be because the minimum threshold was still higher (close to 0). Because I think even if you decrease alignment score highly negative it will not reject the read.

This is very surprising! From manual The default in --end-to-end mode is L,-0.6,-0.6 and the default in --local mode is G,20,8. => The score min in end-to-end for 100bp read is -0.6+(-0.6*100) = -60.6, which is much higher than your --mp penalty of 1000. Even if you add --ma <int> in local mode, which defaults to 2 per base, you will gain 200 more, but still not enough to overcome the penalty.

I need to set --score-min 0 or higher to reject the read. In --local mode, the --score-min value can go to positive values (because of match bonus add).

This is not correct. score_min is a minimum alignment score to get a valid alignment, and it can be changed by the parameter --score_min only. In local mode, the positive values of match bonus changes (or increase) the alignment score so as to make it greater than score_min and consequently to make a valid alignment

0
Entering edit mode

Hi,

My command: bowtie2 -p 10 --mp 10000 --no-1mm-upfront -q -x /home/ref_av/mm9_bowtie2/mm9 -U my_trimmed.fastq -S out.sam

From manual, about the mismatch penalty,

--mp MX,MN
Sets the maximum (MX) and minimum (MN) mismatch penalties, both integers. A number less than or equal to MX and greater than or equal to MN is subtracted from the alignment score for each position where a read character aligns to a reference character, the characters do not match, and neither is an N.


Now let's say there is 1 mismatch i.e NM:i:1, So if I set --mp 10000, the alignment score will be -10000 ie. AS:i:-10000.

This is very surprising! From manual The default in --end-to-end mode is L,-0.6,-0.6 and the default in --local mode is G,20,8.

So for 50 bp read, --score-min = -30.6, The AS:-10000 should be rejected. as per definition of valida alignment,

For an alignment to be considered “valid” (i.e. “good enough”) by Bowtie 2, it must have an alignment score no less than the minimum score threshold.


Problem is in my sam file, of course, I do not see AS:i:-10000, but I do see AS:i:-5 or AS:-1 or AS:i:-28, which are the results of mismatch and still persists inside sam file. Ideally, if I set --mp 10000, these reads were not supposed to be here. Am I thinking right way?

This seems --mp is not doing anything to AS:i?

0
Entering edit mode

Ok, a couple of tests comes to my mind to debug this problem:

1. Run with --mp 10000,10000 and also use --ignore-quals (just to be on safer side)

As I told, the MX and MN are to account for read quality. If the bases that not matching are of low Q, then it will get score close to MN. By using --ignore-quals , you always will use the maximum value whatever the baseQ be.

1. Remove --no-1mm-upfront . I am not sure if it is useful here.

2. If nothing works, post a couple of aligned reads from SAM-file.

0
Entering edit mode

Hi Santosh, Thanks for suggestion,

So I ran the pipeline.

Here is the command:

bowtie2 -p 10 --mp 10000,10000 --ignore-quals -x /home/ref_av/mm9_bowtie2/mm9 -U my_trimmed.fastq -S my.mpsant.sam


The problem is still there:

The reads containing mismatch are still present in output sam file:

    SRR******.83    0   chr12   44771981    1   50M *   0   0   CCTCGTGTTTTGTTACGGGATCGTGGGTTCGAGTCCCACCTCGTGCAGAG  BBBBBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF<F  AS:i:    0  XS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:50 YT:Z:UU

SRR********.45  16  chr5    78066692    0   13M1D1I36M  *   0   0   AGTACAGGGGAACACCAGGGCCAAGAAGCGGGAGTGGGTGGGTAGAGGAG  FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFBBBB<  **-AS:i:-16**   XS:i:-16    XN:i:0  XM:i:0  XO:i:2  XG:i:2  NM:i:2  MD:Z:13^G36 YT:Z:UU

SRR******.72    16  chr2    98507289    0   18M1D1I31M  *   0   0   AAGGACCTGGAATATGGCAAGAAAACTGAAAATCACGGAAAATGAGAAAT  FFFFFFFFFFFFFFFBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFBBBBB  **-AS:i:-16**   XS:i:-16    XN:i:0  XM:i:0  XO:i:2  XG:i:2  NM:i:2  MD:Z:18^G31 YT:Z:UU

SRR******.87    16  chr2    98504099    16  17M1D33M    *   0   0   TTTTCCGTGATTTTCAGTTTCTCGCCATATTCCAGGTCTTACAGTGTGCA  FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFBBBBB  **-AS:i:-8**    XS:i:-24    XN:i:0  XM:i:0  XO:i:1  XG:i:1  NM:i:1  MD:Z:17^T33 YT:Z:UU


Please suggest what could be the reason behind this.

Am I doing something wrong?

Thanks

Ankit

1
Entering edit mode

Hi Ankit,

First of all, I have moved my comment as an answer, as it was getting longer and messier in the comment section.

Your SAM is ok. The problem is that you are giving huge penalty for mismatch, but keeping default for Insertion and deletion. So bowtie is trying to find mismatches as a combination of insertion and deletion. eg:

MD:Z:13^G36 : Here there are 13 matching bases, then there is one deletion(^) and then one insertion(G) and then 36 matching bases. In effect, you query has a mismatch as (G), but bowtie treats it as consecutive insertion and deletion to avoid huge penalty on mismatches. You can also see the same thing by CIGAR string and NM

13M1D1I36M = 13M 1D 1I 36M says that you have 13(aligned sequences == match or mismatch of single bp) + 1(Deletion) + 1(Insertion) + 36(Match).

NM = 2, means edit distance = 2. It means how many changes you need to do to make the query equal to the REF (which is 1INSERTION + 1DEL = 2 changes).

You can also see that the score AS is -16 because by default:

 --rdg <int1>,<int2>
Sets the read gap open (<int1>) and extend (<int2>) penalties. A read gap of length N gets a penalty of <int1> + N * <int2>. Default: 5, 3.

--rfg <int1>,<int2>
Sets the reference gap open (<int1>) and extend (<int2>) penalties. A reference gap of length N gets a penalty of <int1> + N * <int2>. Default: 5, 3.


The default to open a gap in REFERENCE/READ is 5 and extend is 3. So if you have insertion + deletion, it will get a score of -(5+3) = -8

Now in end-to-end (default) mode, the score_min is L,-0.6,-0.6 => -0.6 + (-0.6) * 50 = -30.6 and as your AS is lower than this, it is reported.

you have several options to stop this, like give high penalty to Gap open and extend (--rdg and --rfg) or change --score_min (lets make it positive), then any negative penalty will not able to overcome the score_min. I'll suggest to make score_min a positive value and use --end-to-end alignment flag explicitly. I hope your understand the basics and can explore further, what will best suit your need

0
Entering edit mode

Hi Santosh, Thanks for the reply.

like give high penalty to Gap open and extend (--rdg and --rfg) or change --score_min (lets make it positive),

So If there is a choice between setting --score-min or --rdg , --rfg, which one is good to choose.

As I have discussed before, I have set --score-min C,0,0. Why that parameter or way was wrong? The only problem was I was losing lot of reads but at least I was able to get rid of the mismatched reads. This basically fixed the threshold of valid alignment to 0. That leads to rejection of my all the reads which contains penalty (AS:i:<0).

The default to open a gap in REFERENCE/READ is 5 and extend is 3. So if you have insertion + deletion, it will get a score of -(5+3) = -8

Now in default mode --score-min:-30.6. It seems like --mp higher values will not influence AS until I change the --rfg, --rdg ?? Or --mp is not the right parameter for my specific requirement ??

The questions arise what should be the values. (Default: 5, 3 gives -8). Should I try (50,30 which can give -80). Is it just a random choice to keep it sufficiently negative ?

Thanks

Ankit

0
Entering edit mode

So If there is a choice between setting --score-min or --rdg , --rfg, which one is good to choose.

BTW, to quickly debug the problem, create a new fastq of only of those 4 reads and align them.

As I have discussed before, I have set --score-min C,0,0. Why that parameter or way was wrong? The only problem was I was losing lot of reads but at least I was able to get rid of the mismatched reads. This basically fixed the threshold of valid alignment to 0. That leads to rejection of my all the reads which contains penalty (AS:i:<0).

So you already tried this! This may happen. Because you are asking no gap(INDEL) or mismatch in the alignment, and also an end-to-end match, it is possible that most of your reads are not passing the very stringent criteria. (Unless you have a counterexample to show that a read should have been reported as valid alignment, but discarded because of AS <0)

Should I try (50,30 which can give -80). Is it just a random choice to keep it sufficiently negative ?

You may try that. However, as you have already tried with score_min and only few reads are passing the criteria, I believe the problem is not of the penalty. My guess is that there are very few complete 50bp read which match perfectly to your REF.

You should also check for any adapter contamination, and trim the reads for even a 1-bp match with adapter, because in the end-to-end mode, it will be impossible to align perfectly even if you have one extra base which is not from the species and hence not matching. Moreover, you should double check that the sequencing quality is good (fastqc), because with such a stringent criteria, you will find no match if the bases are not sequenced to the highest quality.

Last but not the least: An alternative way to your problem is that you may align the read in the normal way without any special param, and filter on NM (NM:i:0). These are the reads (NM:i:0) which map without any mismatch/indel.

0
Entering edit mode

Hi Santosh,

You should also check for any adapter contamination you should double check that the sequencing quality is good (fastqc)

So I checked the quality there is no adapter contamination which I removed prior to starting analysis.

An alternative way to your problem is that you may align the read in the normal way without any special param, and filter on NM (NM:i:0). These are the reads (NM:i:0) which map without any mismatch/indel.

This I already tried. Ultimately I will lose all the reads which contain even a single mismatch either by NM:i:0 or by --score-min C,0,0.

I was thinking for another approach. Do you know any software which can allow mismatch at the rest of genome but not at my region of interest or coordinates. I know SnpSplit, it creates N- masked genome. Problem is, the VCF file for the specific mouse strains does not included the strain of our interest. So I thought an alternative.

Thanks again !

0
Entering edit mode

Do you know any software which can allow mismatch at the rest of genome but not at my region of interest or coordinates.

QuasR does similar thing (or even better). It will create alternative genome by replacing the variant paternal and maternal SNPs in the REF. Then it maps to these genomes, and take the best scoring match. It works in mostly automated fashion after initial setup

https://bioconductor.org/packages/3.8/bioc/vignettes/QuasR/inst/doc/QuasR.html#65_allele-specific_analysis

0
Entering edit mode

Hi Santosh,

Thank you for the suggestion.

So I ran QuasR pieline for Allele Specfic analysis. The output is not so useful, because it generated a combined bam file for alignment on two different genome.So in the end there is no way I can discriminate between the two alleles and only I can count total reads aligned on two different genomes (as generated by QuasR by incorporating SNPs) that too for a feature and not the individual peak. QuasR reported reads aligned on R (reference allele), A (alternative allele), U (unclassifiable, reads shared b/w both the genome).

R: the read aligned better to the reference genome
U: the read aligned equally well to both genomes (unknown origin)
A: the read aligned better to the alternative genome

My output:
width Sample1_R Sample1_U Sample1_A Sample2_R Sample2_U Sample2_A Sample3_R Sample3_U

ENSMUSG00000015202.7   1500         5        31         5         7        18         5         5        17
ENSMUSG00000064065.8   1500         6        11         6         6        19         7         6        15
ENSMUSG00000078488.1   1500         0         3         1         1        16         0         3        13
ENSMUSG00000019775.10  1500         4        22         4         2        24         3         0        10
ENSMUSG00000019774.8   1500        13        27         7         6        18         3         4        20
ENSMUSG00000019773.7   1500        11        36         2         2        10         0         6        22
Sample3_A Sample4_R Sample4_U Sample4_A
ENSMUSG00000015202.7          3         0        26         6
ENSMUSG00000064065.8          8         6        19         4
ENSMUSG00000078488.1          2         2        22         1
ENSMUSG00000019775.10         4         4         7         0
ENSMUSG00000019774.8          2         0        10         6
ENSMUSG00000019773.7          0         3         3         0
`

As you will notice, I have only the counts on the feature and not on the peaks. However a single feature might have more than 1 peak. Representing it in single counts is not reliable approach and not useful for my downstream processing.

Also the wiggle file, so generated carry combined peaks of both the alleles.

I was thinking if some way I can extract only the peak count as well as different bam or wiggle files for reads aligned on two different genomes.

SNPsplit however splitted the reads (output: bam) on the basis of SNPs file given and N-maked genome created. .

Thanks AV