Question: How does Base Call casing affect BWA
1
gravatar for oconnwald
11 days ago by
oconnwald20
Boulder, Colorado
oconnwald20 wrote:

I was wondering how the casing of base calls affected BWA MEM performance.

I ran three sets of data to see if there was any difference at all:

  • All base calls capitalized
  • All base calls lower-cased
  • Random casing for all base calls

I found that the first two sets (when casing was consistent) produced the same exact results whereas there were subtle differences in the last fastq. Here are the samtools flagstats for each:

All upper:

618526 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
1242 + 0 supplementary
0 + 0 duplicates
616822 + 0 mapped (99.72% : N/A)
617284 + 0 paired in sequencing
308642 + 0 read1
308642 + 0 read2
612104 + 0 properly paired (99.16% : N/A)
614584 + 0 with itself and mate mapped
996 + 0 singletons (0.16% : N/A)
2130 + 0 with mate mapped to a different chr
1310 + 0 with mate mapped to a different chr (mapQ>=5)

All lower:

618526 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
1242 + 0 supplementary
0 + 0 duplicates
616822 + 0 mapped (99.72% : N/A)
617284 + 0 paired in sequencing
308642 + 0 read1
308642 + 0 read2
612104 + 0 properly paired (99.16% : N/A)
614584 + 0 with itself and mate mapped
996 + 0 singletons (0.16% : N/A)
2130 + 0 with mate mapped to a different chr
1310 + 0 with mate mapped to a different chr (mapQ>=5)

Random-casing:

618538 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
1254 + 0 supplementary
0 + 0 duplicates
616835 + 0 mapped (99.72% : N/A)
617284 + 0 paired in sequencing
308642 + 0 read1
308642 + 0 read2
612096 + 0 properly paired (99.16% : N/A)
614586 + 0 with itself and mate mapped
995 + 0 singletons (0.16% : N/A)
2140 + 0 with mate mapped to a different chr
1319 + 0 with mate mapped to a different chr (mapQ>=5)

Does anyone know why the aligner is treating the random cased fastq differently?

Here is my BWA/Samtools sort call:

    ${PATH_TO_BWA} mem \
        -R "@RG\tID:${PREFIX}\tSM:${PREFIX}\tLB:${PREFIX}\tPL:ILLUMINA" \
        -t ${BWA_THREADS} \
        -Y \
        ${REF_PREFIX} ${R1_FASTQ} ${R2_FASTQ} \
    | \
    ${PATH_TO_SAMTOOLS} sort - \
        -@ ${SAMTOOLS_THREADS} \
        -o ${OUTPUT_BAM} \
        -T ${TMP_DIR}/${PREFIX}_samtools_tmp
bwa samtools alignment • 109 views
ADD COMMENTlink modified 11 days ago • written 11 days ago by oconnwald20
2

See: A: BWA mem output inconsistent on same but re-ordered FASTQ input

I don't think bwa has a deterministic option to ensure that you get exactly the same result each time. If you are truly interested then you will need to dive into code to see if the case has any effect.

Case has a effect when present in reference (A: Which Aligners Recognize Soft-Masked Repeats In Reference Sequences? ) for some aligners. Heng Li does not recommend using a masked reference.

ADD REPLYlink modified 11 days ago • written 11 days ago by genomax87k

That's interesting -- but the ordering of the reads isn't changing in each run (only casing), so I don't know if the first link applies. We are using a reference that has masked bases though.

ADD REPLYlink modified 11 days ago • written 11 days ago by oconnwald20

As an update, I've run the random-cased fastqs alignment with our reference modified to not include masking (i.e. I capitalized all lower-case bases), and I'm getting equivalent results as I did when I ran alignment of the fastqs with the reference with masking. So, I think the casing is here making a difference?

ADD REPLYlink written 10 days ago by oconnwald20

How did you lower case and random case fastqs? My suspicion is the reads aren't identical - that is, in addition to the different casing.

ADD REPLYlink written 11 days ago by h.mon30k

The number of reads us different in your third sample. Find out why.

ADD REPLYlink written 11 days ago by swbarnes28.2k

Total reads includes supplemental reads -- you can see the offset on both is +12. If there were variation in the number of raw reads in the same, you'd see that reflected in the read1 and read2 counts

ADD REPLYlink modified 10 days ago • written 11 days ago by oconnwald20

Just a simple Python script (I'm happy to post it if you'd like). I've confirmed that the random casing reads are identical in content (other than casing) by converting the random-cased bases back to the original upper casing and running a diff.

ADD REPLYlink modified 10 days ago • written 11 days ago by oconnwald20
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1661 users visited in the last hour