I have been trying to call some variants with GATK. It complains, rightly, that my the length of the contigs in my BAM file is slightly different to the reference FASTA file that was used for the initial mapping:
##### ERROR MESSAGE: Input files reads and reference have incompatible contigs: Found contigs with the same name but different lengths:
##### ERROR contig reads = Contig_10292 / 7379
##### ERROR contig reference = Contig_10292 / 7378.
I did a comparison of the contig lengths in the BAM (using samtools idxstats), the actual length of the contigs in the FASTA file, and the lengths reported in the .fai index of the reference file. Here are some examples:
Contig_27603 576 577
Contig_27719 1471 1472
Contig_28013 577 578
Contig_28044_0 7967 7968
Contig_28044_1 1536 1537
Contig_28044_4 689 690
The first number is the real length as checked manually in the FASTA file, and is reported correctly in the .fai file. The second number is the contig length in the BAM file.
For every contig, the BAM file thinks contigs are 1 base longer than they actually are.
Oddly, samtools mpileup is perfectly happy to call variants from these bam files using the same reference, and it is definitely the same reference FASTA file that was used for the remapping originally. Also, the BAM files look fine in IGV - they are the right length and contigs start and end with the sequence they are supposed to. (Is contig length stored as a variable somewhere else in BAM files?)
Any idea why this happened, and if it can be fixed by editing the BAM files?
Thanks a million!
Ah, could be --- there is a trailing space before the newline at the end of each FASTA sequence in the reference. Seems I might be able to use the techniques here: BAM header edit
This was the problem. samtools was getting the length for the header from the total length of the FASTA line. Fixed it up and works now!