Question: bwa mem not aligning
2.1 years ago by
senowinski30 wrote:

I am running the following command:

bwa mem -t 4 /users/person/resources/reference/hg19/genome/ucsc.hg19.fasta 160095-T_S2_L003_R1_001.fastq.gz 160095-t_s2_l003_r2_001.fastq.gz > 160095-T_S2_L003_R1_001.fastq.gz.bwa.sam

With the following bwa messages where all four orientations are being skipped:-

[M::bwa_idx_load_from_disk] read 0 ALT contigs
[M::process] read 546760 sequences (40000002 bp)...
[M::process] read 546534 sequences (40000140 bp)...
[M::mem_pestat] # candidate unique pairs for (FF, FR, RF, RR): (1, 1, 0, 0)
[M::mem_pestat] skip orientation FF as there are not enough pairs
[M::mem_pestat] skip orientation FR as there are not enough pairs
[M::mem_pestat] skip orientation RF as there are not enough pairs
[M::mem_pestat] skip orientation RR as there are not enough pairs
[mem_sam_pe] paired reads have different names: "NS500784:187:HHYNGAFXX:4:11401:17815:1049", "NS500784:187:HHYNGAFXX:1:11101:13503:1043"

[mem_sam_pe] paired reads have different names: "NS500784:187:HHYNGAFXX:4:11401:8106:1064", "NS500784:187:HHYNGAFXX:1:11101:4077:1043"

When I open the sam file it looks like this:

@PG ID:bwa  PN:bwa  VN:0.7.15-r1140 CL:bwa/users/person/resources/reference/hg19/genome/ucsc.hg19.fasta 160095-T_S2_L004_R1_001.fastq.gz 160095-t_s2_l003_r2_001.fastq.gz

BUT when I use samtools to convert it to a bam file it's empty!

Can anyone advise?

My guess is that your paired files are not properly paired. The names of your reads don't seem to match between your pairs.

I get this error from read1 and 2 from lane 3 and 4 of this sample. all my other samples were fine. so I also thought something similar - so I tried all alignment combinations of read1 and 2 from the different lanes with the same problem - I also tried concatenation read1s from lane 3 and 4 and aligned this with read2s from lane 3 and 4.

is there a way to fix the different names of the reads?

How to fix it? Find the right pairs.

What does fastqc / multiqc look like? Any major flags? Is it possible that your forward / reverse reads are randomly sorted?

fastqc was fine - no flags. this is a problem with read 1 and read 2 from lane 3 and 4 of this sample. I have 29, and this is the only one with this problem

2.1 years ago by
Pierre Lindenbaum118k wrote:

It's not a problem with bwa but it's a problem with your pair of fastq. You're mapping two fastqs that come from different lanes (L004 and L003)

  • 160095-T_S2_L004_R1_001.fastq.gz should be mapped with 160095-T_S2_L004_R2_001.fastq.gz (R1 and R2)

  • 160095-t_s2_l003_r1_001.fastq.gz should be mapped with 160095-t_s2_l003_r2_001.fastq.gz (R1 and R2)

Hey, yeah, that was a typo - nothing was working so I tried different combinations of lanes just incase.

The above is from read1 and read2 from Lane 3.

I also tried:

read1 lane3 with read2 lane4

and I also concatenated R1 from both lane 3 and lane 4, and then tried to align with R2 from lane 3 and 4 with the same problem.

Thank you for noticing I didn't properly go through my question! I have tried all possible combinations of read1 and 2 from lane 3 and 4 from this sample because it is not aligning. and all have the same outcome - skipped all 4 orientations. I have 28 samples that aligned well, but not the case for this one...

do you have any other suggestions as to why this would be? is it the case that it has just come from different samples? do you think there was a sequencing error? library? labelling? pair names? and are you able to advise on what to do next - how to fix this problem?


there is something wrong with your fastqs. You should have the same read name in the R1 and R2 fastqs.

check for the same read names:

paste <(gunzip -c file.R1.fastq.gz | paste - - - - | cut -f 1)  paste <(gunzip -c file.R2.fastq.gz | paste - - - - | cut -f 1)  | grep -m 1 -F "NS500784:187:HHYNGAFXX:4:11401:17815:1049"

you should get two columns with the same read name R1/R2.

How to fix this ? I wouldn't trust this kind of data anyway. There is something wrong in your data.

When you notice data not aligning as expected the first thing you should check is to take a few reads and blast them at NCBI to make sure you are looking at the right genome/sample. It won't be the first time a sample has been contaminated (somewhere along the line) and the data is not correct to begin with.

Have you scanned and trimmed your data for presence of adapters?

If all that checks out then try tool from BBMap suite that can fix read order issues in PE data files. You would use it like in1=r1.fq.gz in2=r2.fq.gz out1=fixed1.fq.gz out2=fixed2.fq.gz outsingle=singletons.fq.gz

It appears that the names of the file do not match the data. From your example:

"NS500784:187:HHYNGAFXX:4:11401:17815:1049", "NS500784:187:HHYNGAFXX:1:11101:13503:1043"

The first read is from lane 4 (the number after the flow cell identifier HHYNGAFXX), while the second is from lane 1.

