Error using BWA to map environmental transcriptome against a genomic reference
1
0
Entering edit mode
2.3 years ago
John ▴ 20

I have quality controlled paired end environmental transcriptomic data that I want to map against a reference database of 8 cyanobacterial genomes. I made this reference by joining together the fasta files of each genome.

I performed this mapping with BWA v0.7.3a but noticed that when I tried to count the reads mapped to each gene in the reference with htseq-count, the SAM file output from BWA was corrupted.

Here is my BWA command:

bwa mem -k 10 -aM -t 128 Merged_reference Input_file1 Input_file2 > Out.sam

Here is my HTSeq command:

htseq-count -f sam -r pos -a 0 -m intersection-strict -s no Out.sam Merged_ref2.gtf2 > HTSeq_out

This is the error I see when running HTSeq:

[W::sam_parse1] mapped query cannot have zero coordinate; treated as unmapped
[E::sam_parse1] CIGAR and query sequence are of different length
[W::sam_read1_sam] Parse error at line 8839134
Error occured when processing input (record #8809785 in file Out.sam):
  truncated file
  [Exception type: OSError, raised in libcalignmentfile.pyx:1876]

Here is what the SAM file looks like at lines 8839134 and 8839135:

A00521:380:HFKMLDSX5:4:1163:32081:25332 65      lcl|NC_013771.1_cds_WP_012953431.1_1    0       0       268435455S89S   lcl|NC_013771.1_cds_WP_012954293.1_845  87      0       Sequence   Quality     NM:i:8388607    AS:i:73 XS:i:72

A00521:380:HFKMLDSX5:4:1163:32081:25332 129     lcl|NC_013771.1_cds_WP_012954293.1_845  87      0       2S15M72S        lcl|NC_013771.1_cds_WP_012953431.1_1    0       0      Sequence  Quality  NM:i:0  AS:i:15 XS:i:15

I know that if I change these lines to the following and run htseq on a subset of my data up to the changed lines the error disappears:

A00521:380:HFKMLDSX5:4:1163:32081:25332 65      *    0       0       *   *      0     0 Sequence   Quality     NM:i:8388607    AS:i:73 XS:i:72

A00521:380:HFKMLDSX5:4:1163:32081:25332 129     *    0       0       *   *      0     0      Sequence  Quality  NM:i:0  AS:i:15 XS:i:15

While this change makes sense to me because the weird sequence causing this should be treated as an unmapped read, unfortunately I have noticed that this error occurs approximately every 4 million lines in my SAM file. Since I have ~115 million lines in the SAM file and 60+ SAM files I would like to get to the bottom of this to efficiently solve this problem.

Has anyone else seen this issue? I can't figure out why BWA would be making this mistake!

A few other notes, every time there is this error it is triggered by the lcl|NC_013771.1_cds_WP_012954293.1_845 sequence. This sequence is the first one in my reference file.

BWA • 935 views
ADD COMMENT
1
Entering edit mode

First step is to run samtools quickcheck -qvvv x.bam on every bam to see whether it's anything obvious on all or just some files.

ADD REPLY
2
Entering edit mode
2.3 years ago
John ▴ 20

I updated my version of BWA to v0.7.17 and this issue is resolved.

ADD COMMENT

Login before adding your answer.

Traffic: 1712 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6