Question: Length Of Contigs In Bam File Off By 1 Versus Reference Fasta
0
gravatar for Tancata
5.9 years ago by
Tancata200
Newcastle, UK
Tancata200 wrote:

I have been trying to call some variants with GATK. It complains, rightly, that my the length of the contigs in my BAM file is slightly different to the reference FASTA file that was used for the initial mapping:

##### ERROR MESSAGE: Input files reads and reference have incompatible contigs: Found contigs with the same name but different lengths:
##### ERROR   contig reads = Contig_10292 / 7379
##### ERROR   contig reference = Contig_10292 / 7378.

I did a comparison of the contig lengths in the BAM (using samtools idxstats), the actual length of the contigs in the FASTA file, and the lengths reported in the .fai index of the reference file. Here are some examples:

Contig_27603 576 577

Contig_27719 1471 1472

Contig_28013 577 578

Contig_28044_0 7967 7968

Contig_28044_1 1536 1537

Contig_28044_4 689 690

The first number is the real length as checked manually in the FASTA file, and is reported correctly in the .fai file. The second number is the contig length in the BAM file.

For every contig, the BAM file thinks contigs are 1 base longer than they actually are.

Oddly, samtools mpileup is perfectly happy to call variants from these bam files using the same reference, and it is definitely the same reference FASTA file that was used for the remapping originally. Also, the BAM files look fine in IGV - they are the right length and contigs start and end with the sequence they are supposed to. (Is contig length stored as a variable somewhere else in BAM files?)

Any idea why this happened, and if it can be fixed by editing the BAM files?

Thanks a million!

gatk ngs samtools • 2.7k views
ADD COMMENTlink modified 5.9 years ago by swbarnes25.3k • written 5.9 years ago by Tancata200
1
gravatar for swbarnes2
5.9 years ago by
swbarnes25.3k
United States
swbarnes25.3k wrote:

You might have some kind of white space in the last character of each line of your reference that is causing the problem.

Maybe you can fix the headers in the .bam file, and that will be good enough.

ADD COMMENTlink written 5.9 years ago by swbarnes25.3k

Ah, could be --- there is a trailing space before the newline at the end of each FASTA sequence in the reference. Seems I might be able to use the techniques here: BAM header edit

ADD REPLYlink modified 5.9 years ago • written 5.9 years ago by Tancata200

This was the problem. samtools was getting the length for the header from the total length of the FASTA line. Fixed it up and works now!

ADD REPLYlink written 5.9 years ago by Tancata200
0
gravatar for Istvan Albert
5.9 years ago by
Istvan Albert ♦♦ 80k
University Park, USA
Istvan Albert ♦♦ 80k wrote:

Well the BAM file has a zero based coordinate system that starts at 0 and does not include the last coordinate. This really sounds like a type of problem you get when mixing different coordinate systems.

ADD COMMENTlink modified 5.9 years ago • written 5.9 years ago by Istvan Albert ♦♦ 80k

"There are only two hard things in computer science: naming things, cache invalidation and off-by-one errors" from here A: What are the most common stupid mistakes in bioinformatics? :-D

ADD REPLYlink modified 5.9 years ago • written 5.9 years ago by Manu Prestat3.9k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1778 users visited in the last hour