Question: why does qualimap2 shows different results for number of mismatches for bowtie2 and BWAMEM?
0
gravatar for maria2019
12 months ago by
maria201990
maria201990 wrote:

Hi, I am very new to NGS analysis. I want to do WES analysis on a normal human embryonic stem cell line. I have aligned the paired-end reads (2* 150) on the the hg38 with bowtie2 and bwamem. after the alignment and then realignment around the indel, I checked the quality with the qualimap2 and I got VERY different result for mismatches, insertion, and indels.

Mismatches and indels (inside of regions)

Bowtie2 results: General error rate 1.11% Mismatches 35,875,622 Insertions 6,729,119 Mapped reads with at least one insertion 6.51% Deletions 812,397 Mapped reads with at least one deletion1.26% Homopolymer indels 16.81%

BWAmem result: General error rate 0.27% Mismatches 20,648,892 Insertions 276,090 Mapped reads with at least one insertion 0.33% Deletions 332,003 Mapped reads with at least one deletion 0.4% Homopolymer indels 59.3%

My questions are: 1. This is a normal cell line (not cancerous)! Should I expect this many mismatches? 2. Should I expect such a difference between the bowtie2 and bwa mem results?

qualimap wes bowtie2 bwamem • 499 views
ADD COMMENTlink modified 12 months ago by h.mon29k • written 12 months ago by maria201990

Changed to Question rather than Page or Tool, please leave it like that.

ADD REPLYlink written 12 months ago by ATpoint31k

You didn't say what were the commands used to map the reads (Bowtie2 in particular allows a lot of customization), nor the overall mappings metrics (number of mapped reads, uniquely and multiply mapped reads, and so on). You also didn't say which sequencing technology did you use. Finally, if you are doing quality control of exome sequencing, it is very helpful to pass to Qualimap2 a bed file with the probes capture regions, so it can provide inside- and outside-probe mapping metrics.

ADD REPLYlink written 12 months ago by h.mon29k

Hi, thanks for your response. Below are the commands I used:

  1. Trimming with cutadapt reports: Encoding: Sanger . Illumina 1.9, Tot sequences: 56558516, GC%: 49%
  2. alignment:

2.1. bwamem:

$ bwa index genome.fasta genome

$ bwa mem -M -t 70 hg38.fa trimmed.R1.fastq trimmed.R2.fastq | samtools sort -@70 -o WES_bwamem_sorted.bam

$ samtools view WES_bwamem_sorted.bam |wc -l ---> result =113413327

2.2. bowtie2

$ bowtie2-build genome.fa hg38

$ ./bowtie2 -t -q -p 70 -x ~/indexes/h38 -1 trimmed.R1.fastq -S trimmed.R2.fastq -S WES_bowtie2.sam

$ samtools sort WES_bowtie2.sam -o WES_bowtie2_sorted.bam (samtools 1.9)

$ samtools view WES_bowti2_sorted.bam |wc -l --> result =113116752

The rest is the same for both:

  1. Add tags with picard: $ java -jar picard.jar AddOrReplaceReadGroups I=WES_bwamem_sorted.bam O=WES_bwamem_sorted_rep.bam RGID=HT52.6 RGLB=lib1 RGPL=illumina RGPU=WBBXX.6 RGSM=K00281 VALIDATION_STRINGENCY=LENIENT

  2. Remove duplicates with picard: $ java -jar picard.jar MarkDuplicates INPUT=WES_bwamem_sorted_rep.bam OUTPUT=dupmarked.bam METRICS_FILE=dupmarkmetrics.txt REMOVE_DUPLICATES=true VALIDATION_STRINGENCY=LENIENT

  3. Index bam file with samtools: $ samtools index dupmarked.bam dupmarked.bam.bai

  4. Locate indels (GATK 3.8): $ java -jar GenomeAnalysisTK.jar -T RealignerTargetCreator -R hg38.fa -I dupmarked.bam -o out.intervals

  5. Realign around indels: ( GATK 3.8): $ java -jar GenomeAnalysisTK.jar --filter_bases_not_stored -T IndelRealigner -R hg38.fa -I dupmarked.bam -known Homo_sapiens_assembly38.known_indels.vcf -targetIntervals out.intervals -o realigned.bam

  6. Base quality score recalibration: $ java -jar GenomeAnalysisTK.jar -T BaseRecalibrator -R .hg38.fa -I realigned.bam -knownSites latest_dbsnp.vcf -o recal_data.grp

  7. Create recalibrated bam : $ java -jar GenomeAnalysisTK.jar -T PrintReads -R hg38.fa -I realigned.bam -BQSR recal_data.grp -o recal.bam

  8. Qualimap: Window #: 400, Size of a homopolymer: 3, Skip duplicate alignments: No, Analyze overlapping paired-end Reads: No, Draw chromosome limits: Yes

ADD REPLYlink written 12 months ago by maria201990

I used qualimap 2 times. Once I only used hg38 without a bed file. Then I used a bed file that I got from genome browser table: (http://genome.ucsc.edu/cgi-bin/hgTables?hgta_doMainPage=Back) and chose exons option for the output as a bed file. Is that what you mean? Or I will have to create my own bed file?

Results:

My Reference size was: 3,099,922,541

Without bed file (only used hg38.fa). Results:

BWA mem

reads 81,615,786

Mapped reads 81,480,705 / 99.83%

Unmapped reads 135,081 / 0.17%

Mapped paired reads 81,480,705 / 99.83%

Mapped reads, first in pair 40,751,388 / 49.93%

Mapped reads, second in pair 40,729,317 / 49.9%

Mapped reads, both in pair 81,434,763 / 99.78%

Mapped reads, singletons 45,942 / 0.06%

Read min/max/mean length 1 / 151 / 146.51

Duplicated reads (estimated) 33,016,104 / 40.45%

Duplication rate 38.84%

Clipped reads 12,393,533 / 15.19%

Coverage: Mean 3.7278 , Standard Deviation 23.8264

Mapping Quality: Mean Mapping Quality: 44.63

Insert size: Mean: 36,360.19 , Standard Deviation: 2,031,289.85 , P25/Median/P75: 146 / 188 / 241

Mismatches and indels : General error rate 0.25% Mismatches 27,060,341 Insertions 432,950 Mapped reads with at least one insertion 0.51% Deletions 486,236 Mapped reads with at least one deletion 0.58% Homopolymer indels 57.47%

Bowtie2

reads 80,986,143

Mapped reads 74,496,919 / 91.99%

Unmapped reads 6,489,224 / 8.01%

Mapped paired reads 74,496,789 / 91.99%

Mapped reads, first in pair 38,656,666 / 47.73%

Mapped reads, second in pair 35,840,123 / 44.25%

Mapped reads, both in pair 71,519,982 / 88.31%

Mapped reads, singletons 2,976,807 / 3.68%

Read min/max/mean length 1 / 151 / 147.89

Duplicated reads (estimated) 26,008,479 / 32.11%

Duplication rate 32.97%

Clipped reads 0 / 0%

Coverage: Mean 3.5345, Standard Deviation 21.7436

Mapping Quality: Mean Mapping Quality 27.72

Insert size: Mean 12,726.21, Standard Deviation 1,057,876.32, P25/Median/P75 161 / 199 / 249

Mismatches and indels General error rate 0.91% Mismatches 55,590,848 Insertions 9,694,304 Mapped reads with at least one insertion 7.49% Deletions 1,235,926 Mapped reads with at least one deletion 1.52% Homopolymer indels 17.53%

With bed file

(NUMEBR OF MISMATCHES IS IN THE QUESTION)

BWA mem

Warnings: Some regions are not loaded, 114573 regions were skipped because chromosome name was not found in the BAM file.

reads 81,615,786

Mapped reads 81,480,705 / 99.83%

Unmapped reads 135,081 / 0.17%

Mapped paired reads 81,480,705 / 99.83%

Mapped reads, first in pair 40,751,388 / 49.93%

Mapped reads, second in pair 40,729,317 / 49.9%

Mapped reads, both in pair 81,434,763 / 99.78%

Mapped reads, singletons 45,942 / 0.06%

Read min/max/mean length 1 / 151 / 146.51

Clipped reads 10,174,547 / 12.47%

2.3. Globals (inside of regions):

Regions size/percentage of reference 135,021,462 / 4.36%

Mapped reads 69,574,272 / 85.25%

Mapped reads, only first in pair 34,828,403 / 42.67%

Mapped reads, only second in pair 34,745,869 / 42.57%

Mapped reads, both in pair 69,553,031 / 85.22%

Mapped reads, singletons 21,241 / 0.03%

Correct strand reads 0 / 0%

Clipped reads 10,174,547 / 12.47%

Duplicated reads (estimated) 30,358,081 / 43.63%

Duplication rate 43.23%

Coverage (inside of regions), Mean 58.958, Standard Deviation 82.5376

Mapping Quality (inside of regions): Mean Mapping Quality: 58.09

Insert size (inside of regions): Mean 11,841.97, Standard Deviation 1,061,230.03, P25/Median/P75: 146 / 186 / 237

Bowtie2

Warnings Some regions are not loaded 114573 regions were skipped because chromosome name was not found in the BAM file.

reads 64,103,566 Mapped reads 59,192,134 / 92.34%

Unmapped reads 4,911,432 / 7.66%

Mapped paired reads 59,192,037 / 92.34%

Mapped reads, first in pair 30,715,233 / 47.92%

Mapped reads, second in pair 28,476,804 / 44.42%

Mapped reads, both in pair 56,830,127 / 88.65%

Mapped reads, singletons 2,361,910 / 3.68%

Read min/max/mean length 1 / 151 / 147.89

Clipped reads 0 / 0%

2.3. Globals (inside of regions)

Regions size/percentage of reference 135,021,462 / 4.36%

Mapped reads 50,853,962 / 79.33%

Mapped reads, only first in pair 26,307,095 / 41.04%

Mapped reads, only second in pair 24,546,782 / 38.29%

Mapped reads, both in pair 49,046,814 / 76.51%

Mapped reads, singletons 1,807,063 / 2.82%

Correct strand reads 0 / 0%

Clipped reads 0 / 0%

Duplicated reads (estimated) 19,437,562 / 38.22%

Duplication rate 37.57%

2.5. Coverage (inside of regions): Mean 44.4131, Standard Deviation 72.5277

2.6. Mapping Quality (inside of regions): Mean Mapping Quality: 37.84

2.7. Insert size (inside of regions)

Mean 8,897.63

Standard Deviation 858,716.96

P25/Median/P75 160 / 196 / 245

ADD REPLYlink modified 12 months ago • written 12 months ago by maria201990
0
gravatar for h.mon
12 months ago by
h.mon29k
Brazil
h.mon29k wrote:
  1. This is a normal cell line (not cancerous)! Should I expect this many mismatches?

You didn't perform "proper" variant calling, you are just looking at all variants, be them true variants or sequencing errors. For an expected range of true variants numbers, see the answers at What is the expected rate of SNPs within a human exome sequencing with HiSeq? .

  1. Should I expect such a difference between the bowtie2 and bwa mem results?

By default, bowtie2 performs end-to-end alignment, while bwa performs local alignment. This means bowtie2 aligns the whole read, even if some regions don't align well, while bwa will soft-clip the extremities of the read, if this increases the alignment score. You can see this clearly at the clipped reads metrics: bowtie2 has 0%, while bwa has a sizeable proportion of clipped reads. I believe the differences in mismatches, insertions and deletions are mostly due to this, and I think the difference would be a lot smaller if you were to perform variant calling and apply quality filters after variant calling.

The mapping rate and mapping rate inside regions are better for bwa, and for this reason I think you should stay with bwa alignment.

I used a bed file that I got from genome browser table: (http://genome.ucsc.edu/cgi-bin/hgTables?hgta_doMainPage=Back) and chose exons option for the output as a bed file. Is that what you mean?

No, the manufacturer of the exome kit probably has released a bed file with the coordinates of the capture probes, I was suggesting you use this bed.

ADD COMMENTlink modified 12 months ago • written 12 months ago by h.mon29k

Thank you very much. your comments are very helpful

ADD REPLYlink written 12 months ago by maria201990
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 979 users visited in the last hour