Question

How does Base Call casing affect BWA

1

Entering edit mode

3.7 years ago

oconnwald ▴ 20

I was wondering how the casing of base calls affected BWA MEM performance.

I ran three sets of data to see if there was any difference at all:

All base calls capitalized
All base calls lower-cased
Random casing for all base calls

I found that the first two sets (when casing was consistent) produced the same exact results whereas there were subtle differences in the last fastq. Here are the samtools flagstats for each:

All upper:

618526 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
1242 + 0 supplementary
0 + 0 duplicates
616822 + 0 mapped (99.72% : N/A)
617284 + 0 paired in sequencing
308642 + 0 read1
308642 + 0 read2
612104 + 0 properly paired (99.16% : N/A)
614584 + 0 with itself and mate mapped
996 + 0 singletons (0.16% : N/A)
2130 + 0 with mate mapped to a different chr
1310 + 0 with mate mapped to a different chr (mapQ>=5)

All lower:

618526 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
1242 + 0 supplementary
0 + 0 duplicates
616822 + 0 mapped (99.72% : N/A)
617284 + 0 paired in sequencing
308642 + 0 read1
308642 + 0 read2
612104 + 0 properly paired (99.16% : N/A)
614584 + 0 with itself and mate mapped
996 + 0 singletons (0.16% : N/A)
2130 + 0 with mate mapped to a different chr
1310 + 0 with mate mapped to a different chr (mapQ>=5)

Random-casing:

618538 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
1254 + 0 supplementary
0 + 0 duplicates
616835 + 0 mapped (99.72% : N/A)
617284 + 0 paired in sequencing
308642 + 0 read1
308642 + 0 read2
612096 + 0 properly paired (99.16% : N/A)
614586 + 0 with itself and mate mapped
995 + 0 singletons (0.16% : N/A)
2140 + 0 with mate mapped to a different chr
1319 + 0 with mate mapped to a different chr (mapQ>=5)

Does anyone know why the aligner is treating the random cased fastq differently?

Here is my BWA/Samtools sort call:

    ${PATH_TO_BWA} mem \
        -R "@RG\tID:${PREFIX}\tSM:${PREFIX}\tLB:${PREFIX}\tPL:ILLUMINA" \
        -t ${BWA_THREADS} \
        -Y \
        ${REF_PREFIX} ${R1_FASTQ} ${R2_FASTQ} \
    | \
    ${PATH_TO_SAMTOOLS} sort - \
        -@ ${SAMTOOLS_THREADS} \
        -o ${OUTPUT_BAM} \
        -T ${TMP_DIR}/${PREFIX}_samtools_tmp

bwa alignment samtools • 1.2k views

ADD COMMENT • link 3.7 years ago by oconnwald ▴ 20

1

Entering edit mode

See: A: BWA mem output inconsistent on same but re-ordered FASTQ input

I don't think bwa has a deterministic option to ensure that you get exactly the same result each time. If you are truly interested then you will need to dive into code to see if the case has any effect.

Case has a effect when present in reference (A: Which Aligners Recognize Soft-Masked Repeats In Reference Sequences? ) for some aligners. Heng Li does not recommend using a masked reference.

ADD REPLY • link 3.7 years ago by GenoMax 141k

0

Entering edit mode

That's interesting -- but the ordering of the reads isn't changing in each run (only casing), so I don't know if the first link applies. We are using a reference that has masked bases though.

ADD REPLY • link 3.7 years ago by oconnwald ▴ 20

0

Entering edit mode

As an update, I've run the random-cased fastqs alignment with our reference modified to not include masking (i.e. I capitalized all lower-case bases), and I'm getting equivalent results as I did when I ran alignment of the fastqs with the reference with masking. So, I think the casing is here making a difference?

ADD REPLY • link 3.7 years ago by oconnwald ▴ 20

0

Entering edit mode

How did you lower case and random case fastqs? My suspicion is the reads aren't identical - that is, in addition to the different casing.

ADD REPLY • link 3.7 years ago by h.mon 35k

0

Entering edit mode

The number of reads us different in your third sample. Find out why.

ADD REPLY • link 3.7 years ago by swbarnes2 14k

0

Entering edit mode

Total reads includes supplemental reads -- you can see the offset on both is +12. If there were variation in the number of raw reads in the same, you'd see that reflected in the read1 and read2 counts

ADD REPLY • link 3.7 years ago by oconnwald ▴ 20

0

Entering edit mode

Just a simple Python script (I'm happy to post it if you'd like). I've confirmed that the random casing reads are identical in content (other than casing) by converting the random-cased bases back to the original upper casing and running a diff.

ADD REPLY • link 3.7 years ago by oconnwald ▴ 20