Length Of Contigs In Bam File Off By 1 Versus Reference Fasta
2
0
Entering edit mode
10.9 years ago
Tancata ▴ 210

I have been trying to call some variants with GATK. It complains, rightly, that my the length of the contigs in my BAM file is slightly different to the reference FASTA file that was used for the initial mapping:

##### ERROR MESSAGE: Input files reads and reference have incompatible contigs: Found contigs with the same name but different lengths:
##### ERROR   contig reads = Contig_10292 / 7379
##### ERROR   contig reference = Contig_10292 / 7378.

I did a comparison of the contig lengths in the BAM (using samtools idxstats), the actual length of the contigs in the FASTA file, and the lengths reported in the .fai index of the reference file. Here are some examples:

Contig_27603 576 577

Contig_27719 1471 1472

Contig_28013 577 578

Contig_28044_0 7967 7968

Contig_28044_1 1536 1537

Contig_28044_4 689 690

The first number is the real length as checked manually in the FASTA file, and is reported correctly in the .fai file. The second number is the contig length in the BAM file.

For every contig, the BAM file thinks contigs are 1 base longer than they actually are.

Oddly, samtools mpileup is perfectly happy to call variants from these bam files using the same reference, and it is definitely the same reference FASTA file that was used for the remapping originally. Also, the BAM files look fine in IGV - they are the right length and contigs start and end with the sequence they are supposed to. (Is contig length stored as a variable somewhere else in BAM files?)

Any idea why this happened, and if it can be fixed by editing the BAM files?

Thanks a million!

ngs samtools gatk • 4.5k views
ADD COMMENT
1
Entering edit mode
10.9 years ago

You might have some kind of white space in the last character of each line of your reference that is causing the problem.

Maybe you can fix the headers in the .bam file, and that will be good enough.

ADD COMMENT
0
Entering edit mode

Ah, could be --- there is a trailing space before the newline at the end of each FASTA sequence in the reference. Seems I might be able to use the techniques here: BAM header edit

ADD REPLY
0
Entering edit mode

This was the problem. samtools was getting the length for the header from the total length of the FASTA line. Fixed it up and works now!

ADD REPLY
0
Entering edit mode
10.9 years ago

Well the BAM file has a zero based coordinate system that starts at 0 and does not include the last coordinate. This really sounds like a type of problem you get when mixing different coordinate systems.

ADD COMMENT
0
Entering edit mode

"There are only two hard things in computer science: naming things, cache invalidation and off-by-one errors" from here A: What are the most common stupid mistakes in bioinformatics? :-D

ADD REPLY

Login before adding your answer.

Traffic: 1782 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6