Question: Tophat multiple or unique mapping criteria
0
gravatar for maple964
2.0 years ago by
maple9640
maple9640 wrote:

Dear all,

I am very confusing while using mapping RNA-seq to Arabidopsis genome.

Basically, I want to know what to consider while setting threshold about mapping quality. I have followed several infomation like this, http://biofinysics.blogspot.tw/2014/05/how-does-bowtie2-assign-mapq-scores.html.

  1. What to consider while choosing multiple mapping or unique mapping criteria?

However, I failed to find a proper function in Tophat.

  1. Is it possible to set the threshold 95% of alignment identity?

  2. What does the mode of --very-fast or --very-sensitive mean in tophat or bowtie2 manual?

Thank you all in advance

rna-seq • 1.0k views
ADD COMMENTlink modified 2.0 years ago by Macspider2.9k • written 2.0 years ago by maple9640
3
gravatar for Macspider
2.0 years ago by
Macspider2.9k
Vienna - BOKU
Macspider2.9k wrote:

Tophat uses bowtie2 as mapping engine. In the tophat2 help:

  Scoring options
    --b2-mp                        <int>,<int> [ default: 6,2              ]
    --b2-np                        <int>       [ default: 1                ]
    --b2-rdg                       <int>,<int> [ default: 5,3              ]
    --b2-rfg                       <int>,<int> [ default: 5,3              ]
    --b2-score-min                 <func>      [ default: L,-0.6,-0.6      ]

What to consider while choosing multiple mapping or unique mapping criteria?

Remember that these algorithms won't find the best mapping position at the first try all the time. Through the -k parameter you can set how many mapping "attempts" the program has to do before giving up, and for example allowing for only 1 tryout doesn't imply you'll find the best solution. If the mapping seed fits perfectly in one spot but the surrounding is not what is supposed to be, with -k 1 you will retrieve only that. Perhaps there was a mapping position somewhere else with 1 mismatch in the seed that was extending the alignment better, but you didn't consider it because you asked for only one mapping attempt per read.

Another thing to point out: you probably want to ask the algorithm to try, I don't know, say 5 times before giving up, and then from the output file you select the primary alignment by excluding the secondary ones (samtools view -F 0x0100). Among 5 mapping attempts, there is most likely the best one which will be annotated as "primary".

Is it possible to set the threshold 95% of alignment identity?

The --b2-score-min parameter sets the minimum alignment score that you can allow a read to reach before discarding a mapping. The two -0.6 work like this, and can be changed: the read length is multiplied by the third number, and then the second number is added. If your reads are 91 nt long, you will have ( 91 x -0.6 ) + ( -0.6 ) = -55.2

The --b2-mp defines the mismatch penalty: the first number is used when the quality of the base is good, the second when it's bad. If you trimmed the reads before, you might as well use --ignore-quals so the program will only use the first value (6) and it gets easier to get what you want. In this case you can have at most 55.2 / 6 = 9.2 ~ 9 mismatches in one read.

--b2-rdg and --b2-rfg define the read and reference gap opening and extension penalty. In this case you can have a gap which is at most ( 55.2 - 5 ) / 3 = 16.73 ~ 16 nt long.

So all in all you can have either 9 mismatches or 16 nt gap or a combination of the two. With reads that are 91 nt long, having 9 mismatches represents ~10% of the read length and therefore it would be ~90% seq identity.

You can't set a particular percentage of sequence identity, that is done easily in BLAT if you need it, but you can refine your mapping strategy way more with this power.

What does the mode of --very-fast or --very-sensitive mean in tophat or bowtie2 manual?

If you have a closer look, these modes are basically tweaking the parameters that I just described you in such a way that the alignment becomes faster (i.e. not trying "hard" to map a read) or sensitive (i.e. trying "hard").

ADD COMMENTlink written 2.0 years ago by Macspider2.9k

Dear Macspider, Thank you for your reply. I have some question.

Why use 55.2 / 6 ? and knowing 9 mismatches in one read. and ( 55.2 - 5 ) / 3 ?

What are the "floor" and "MIN" meaning from the equation "MN + floor( (MX-MN)(MIN(Q, 40.0)/40.0) ) " ? Any example? (I couldn't find it online)

ADD REPLYlink written 24 months ago by maple9640
1

What is the floor function? https://en.wikipedia.org/wiki/Floor_and_ceiling_functions

What is the min function? https://en.wikipedia.org/wiki/Maxima_and_minima

How do they work in this equation: let's say you set MX=6 and MN=2. This means that the maximum gap penalty is 6 while the minimum is 2. What regulates the penalty is the mapping quality of the position. If you set also --ignore-quals, it will always use MX=6 regardless of the quality. If you don't use that flag, qualities will count. How? they are phred scores, therefore you can assume they are in a range of [0;42]. For each mismatch, you have to ask yourself: how much of the gap penalty should I subtract from the score for this one? Is it a good mismatch, or a bad quality one? To translate this into numbers, on a scale from MN=2 to MX=6, how much is this mismatch worth? The more it is a good one, the more you subtract from the score. The less reliable it is, the less you subtract from the alignment score to avoid biasing the alignment with a false positive mismatch.

Therefore:

  • for each mismatch position take the mapping quality (Q) in that position: let's say you have one which is Q=30
  • select the minimum value between 30 and 40 (sometimes you might have 41 or 42 as Q, in that case you take 40)
  • divide this value by 40: in this case you obtain 3/4 = 0.75
  • define the distance between a complete mismatch penalty (MX) and the bad quality mismatch penalty (MN). In this case: 6-2=4
  • Multiply 4 * 0.75 = 3 and then MN+3 = 2+3 = 5 this is your gap penalty for this position
  • Subtract 5 from your current alignment score
  • If the score has gone under the threshold that you set with L,-0.6,-0.6 (which in this example case is -55.2) break the alignment
ADD REPLYlink written 24 months ago by Macspider2.9k
1

Why use 55.2 / 6 ? and knowing 9 mismatches in one read. and ( 55.2 - 5 ) / 3 ?

Penalties are completely arbitrary, you have to find your ones. You can set a scoring function that allows the score to go only until -10, instead of -55.2, and then assign -1 for each mismatch and -1 for each gap, you will get the same as if you allowed the score to reach -100 and assigned -10 for gaps and mismatches. It's a matter of fine tuning.

ADD REPLYlink written 24 months ago by Macspider2.9k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 959 users visited in the last hour