SAM Validation Error with CleanSam (Aligment start must be <= reference seq length)
0
0
Entering edit mode
7 weeks ago
shpak.max • 0

I ran several bam files through a pipeline with CleanSam, SortSam, and MarkDuplicates without a problem.

However, one of the input files gave me the following error with CleanSam:

ERROR: Record 2106053, Read name A00187:414:HMYCYDSXY:3:1426:13367:11083, Alignment start   (21157039) must be <= reference sequence length (21154825) on reference 7


Because all of the bam files were generated from libraries from the same dataset using the same pipeline and aligned/mapped to the same reference genome, I'm having difficulty knowing where to begin to trouble shoot this error. The Picard script that I used is:

"java -Xmx" . $mem . "g -Djava.io.tmpdir=pwd/tmp -jar " .$picard . "CleanSam.jar INPUT=" . $BFile[$i] . ".bam OUTPUT= " . $BFile[$i] . "clean.bam";


Where Bfile is just the prefix from a glob list of input bam file names S1.bam....S8.bam

Any suggestions on where to start? Since I'm using the same reference genome for this as for the alignment I don't understand how it's possible to get coordinates outside the range of the reference genome length.

Samtools Picard • 666 views
1
Entering edit mode

try to use VALIDATION_STRINGENCY=LENIENT

0
Entering edit mode

Could you please explain why I'm getting this error message to begin with?

Additionally, I assume that I will have to use this for every strage of the piplein, i.e. CleanSam, SortSam, MarkDuplicates, etc?

0
Entering edit mode

Could you please explain why I'm getting this error message to begin with?

for example, if you have one read mapped at the end of the chr1 but this read contains some clipped bases then its unclipped 3' end will be greater than the size of the chromosome 1 .

the best is to look at the read A00187:414:HMYCYDSXY:3:1426:13367:11083 ....

0
Entering edit mode

, I assume that I will have to use this for every strage of the piplein, i.

yes , but...

use samtools sort instead of SortSam

use sambamba markdup instead of MarkDuplicates

0
Entering edit mode

Running the bam files through the pipeline with Validation_Strategy=Lenient does generate the desired cleaned/sorted files. However, when I attempt to run the resulting bam file through GATK, I get the error:

MESSAGE: Input files reads and reference have incompatible contigs: Found contigs with the same name but different lengths:
##### ERROR   contig reads = 11 / 368909
##### ERROR   contig reference = 11 / 368976


If as you suggest the reads were "overhanging" the reference sequence, would I expect to see this error due to mismatches between the mapped read coordinates and the reference sequence, and if so, are there any arguments I can pass to GATK to correct this error?

Note: this error is not due to using inconsistent reference genomes. I used the same Drosophila melanogaster reference genome for this alignment as for all other libraries. The only difference is that I had to use the lenient validation strategy during the sort/clean/markduplicates phase of the pipeline for this particular library.

0
Entering edit mode

the reference used to map the reads is not the same that the one you're using for gatk.

0
Entering edit mode

I'm using the same Drosophila reference genome for mapping as for GATK, which is why I'm not sure why I'm getting this error message from the latter.

Could you please clarify what you're stating as the phrasing is ambiguous: do you mean to suggest that I'm getting this error message because I'm failing to use the same reference genome for both mapping and GATK, or because I'm using the same reference genome but shouldn't be? Presumably you mean the first, but as I said, I'm using the same reference genome throughout the pipeline.

That is why I think that this error is somehow a consequence of using the validation strategy = lenient condition in sort/clean bam, as that is the only thing that has changed in the pipeline.