samtools calmd and original base quality
1
0
Entering edit mode
16 months ago
octpus616 ▴ 100

Hi,

I'm currently trying to use samtools calmd to calculates MD and NM tags for my bam files, I noticed that the typical usage mentioned in the documentation (http://www.htslib.org/doc/samtools-calmd.html) is

samtools calmd -bAr aln.bam > aln.baq.bam

and the params -b -A -r means:

OPTIONS
-A
When used jointly with -r this option overwrites the original base quality.

-e
Convert a the read base to = if it is identical to the aligned reference base. Indel caller does not support the = bases at the moment.

-u
Output uncompressed BAM

-b
Output compressed BAM

-C INT
Coefficient to cap mapping quality of poorly mapped reads. See the mpileup command for details. [0]

-r
Compute the BQ tag (without -A) or cap base quality by BAQ (with -A).

-E
Extended BAQ calculation. This option trades specificity for sensitivity, though the effect is minor.

--no-PG
Do not add a @PG line to the header of the output file.

-@, --threads INT
Number of input/output compression threads to use in addition to main thread [0].

I noticed that rewriting base quality is mentioned here, what does base quality refer to here? Is it mapping Quality? If so, why does the calculation of the NM value affect the judgment of mapping Quality?

NGS bam samtools • 1.5k views
ADD COMMENT
2
Entering edit mode
15 months ago
jkbonfield ★ 1.2k

That quoted usage is an example and has a clear description about applying BAQ. It's not a recommandation for how it should be used in most scenarios. I don't really know anyone who uses BAQ in this way, so the example in the man page is probably very out-dated.

Also Calmd is frankly a basket-case of a subcommand. It appears to have been the catch-all for many features, mostly completely unrelated to calculating the MD tag. It even has an undocumented -N option which can be used to turn off the MD/NM calculation bit, so it doesn't do what the command claims to do! All things considered, I find it a bit of a baffling set of decisions, but we can't really change CLI without breaking pipelines.

The "typical" usages are something like:

samtools calmd -b human_in.bam grch38.fa > human_out.bam
samtools calmd --output-fmt CRAM human_in.bam grch38.fa > human_out.cram
... | samtools calmd -u - grch38.fa  | ...

The latter one is optimal for when it is part of a pipeline, where we want to be inputting and outputting uncompressed BAM between pipeline stages for efficiency. (I mean "pipeline" in the traditional UNIX sense, not the inefficient batch-scheduling task-running modern bioinformatics sense.)

ADD COMMENT
0
Entering edit mode

Thanks for your answer, may I take it that samtools calmd [-b] [-u] are the typical usage to calculate MD/ND, in this application, most of the other parameters do not need too much relationship?

ADD REPLY

Login before adding your answer.

Traffic: 2795 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6