Question: Pair end merging and statistical output
0
gravatar for jomo018
2.4 years ago by
jomo018480
jomo018480 wrote:
  1. I am looking for a paired end merging utility similar to FLASH or PEAR that also outputs statistical information, mainly number of pair mismatches.

  2. Is this type of information available from a SAM file after alignment with BOWTIE, BWA or some other aligner?

paired ends alignment overlap • 982 views
ADD COMMENTlink modified 2.4 years ago by Brian Bushnell16k • written 2.4 years ago by jomo018480

Hi, I think your information about number of pair mismatches (if you speak about your reads merged) is the consequences of your alignment then step 2 :) You can simply got the information if you compare the number of reads aligned between merged reads and no merged reads.

ADD REPLYlink written 2.4 years ago by Titus900

I am looking for mismatches in base resolution, not read resolution. Pairs can be merged (or aligned concordantly) even if some overlapping bases disagree. I am looking for the number or rate of these mismatching bases.

ADD REPLYlink written 2.4 years ago by jomo018480

Do you mean variant calling ?

ADD REPLYlink written 2.4 years ago by Titus900

You can call it variant calling where one mate declares different base than the other.

ADD REPLYlink written 2.4 years ago by jomo018480
2
gravatar for Brian Bushnell
2.4 years ago by
Walnut Creek, USA
Brian Bushnell16k wrote:

BBMerge can do this, with the "ecco" flag, which rather than combining the reads, just does error-correction via overlap:

bbmerge.sh in=reads.fq ecco mix out=corrected.fq
Total time: 1.890 seconds.

Pairs:                  1000000
Joined:                 182539          18.254%
Ambiguous:              817461          81.746%
No Solution:            0               0.000%
Too Short:              0               0.000%
Errors Corrected:       6994

Avg Insert:             159.9
Standard Deviation:     21.5
Mode:                   187

Insert range:           100 - 191
90th percentile:        186
75th percentile:        178
50th percentile:        164
25th percentile:        145
10th percentile:        128
ADD COMMENTlink written 2.4 years ago by Brian Bushnell16k

In this example reads must be interleaved. @Brian: Is that a requirement?

ADD REPLYlink written 2.4 years ago by genomax71k

I am just reading the manual. They also allow in1 and in2 paired inputs.

ADD REPLYlink written 2.4 years ago by jomo018480

Yep, the syntax can also be:

bbmerge.sh in1=r1.fq in2=r2.fq ecco mix out1=corrected1.fq out2=correct2.fq

...but I normally show the interleaved version of the command for conciseness.

ADD REPLYlink written 2.4 years ago by Brian Bushnell16k

Errors Corrected are number of pairs corrected rather than number of bases corrected. Right?

ADD REPLYlink written 2.4 years ago by jomo018480

No, it is the total number of bases corrected.

ADD REPLYlink written 2.4 years ago by Brian Bushnell16k

Thank you Brian. Can you clarify the tag trimq=xx (as opposed to qtrim...). For example, suppose you have a low quality base in the middle of a read, is it considered an N or something else?

ADD REPLYlink written 2.4 years ago by jomo018480

"qtrim" tells the program which end to trim, while "trimq" specifies the quality threshold. Only ends can be trimmed (and mainly the right end is the important one for trimming with respect to merging). "qtrim=r trimq=15" will trim the bases on the right end such that the region trimmed has an average quality below 15, while the remaining region has an average quality at least 15. For example, if the last 5 base qualities were "20, 0, 17, 19, 16", then the last 4 bases would be trimmed as that region has an average quality below 15 (not that 0 is an N) but the 20 would not be trimmed. It's a little hard to calculate by eye because the scores are first transformed to the probability scale (where 16 is roughly 2.5% chance of error) before being averaged. Low-quality bases are not considered N. Rather, if two reads mismatch at a location and one is higher quality than the other, the base with the higher quality is assumed correct and the resuting quality score is higher-lower.

ADD REPLYlink modified 2.4 years ago • written 2.4 years ago by Brian Bushnell16k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 795 users visited in the last hour