Picard ValidateSamFile Error: "ValidateSamFile Value was put into PairInfoMap more than once"
1
0
Entering edit mode
9 weeks ago

I'm having a problem calling SNPs from a sorted bam file with GATK HaplotypeCaller. The file is a merged file of two technical replicates that were each sequenced on different lanes and aligned to the reference genome. When I run Picard ValidateSamFile on the bam file, I got the following error: "ValidateSamFile Value was put into PairInfoMap more than once." I also ran ValidateSamFile on the intermediate sam file for this sample (before its was converted to bam and sorted), and ValidateSamFile did not give me that error message when ran on the sam file. So, it appears that during the step where the sam file was converted into a bam file and sorted with samtools sort, something about read group information was disrupted.

Has anyone encountered this before, or have an ideas as to what would cause this issue? I tried aligning this sample to the reference genome again using the -M tag in bwa-mem (initially I didn't use the -M tag), but this issue wasn't resolved. I haven't been able to find much information about this error, other than suggestions to run Picard AddOrReplaceReadGroups to rename read group information. I'm will likely try this next but I'm going to need to call SNPs on a handful of files like this, so if possible I'd to figure out and fix what the underlying problem is before I resort to replacing the read group information for each sample.

samtools bam sam picard alignment • 365 views
0
Entering edit mode
0
Entering edit mode

"ValidateSamFile Value was put into PairInfoMap more than once."

this is the first part of the message. What is the second part ? ( https://github.com/samtools/htsjdk/blob/master/src/main/java/htsjdk/samtools/CoordinateSortedPairInfoMap.java#L132 )

 throw new SAMException("Value was put into PairInfoMap more than once.  " +
sequenceIndex + ": " + keyAndRecord.getKey());

0
Entering edit mode

Thanks for your reply. Unfortunately, there's is not another part to the error message. After the "ValidateSamFile Value was put into PairInfoMap more than once" there is a "1:" followed by a 31 digit sequence and then things like the date/time and a finished message are output.

0
Entering edit mode

followed by a 31 digit sequence

samtools view in.bam | awk '\$1=="the-31-digit-sequence"'


what is the output ?

0
Entering edit mode

It found 6 matches:

129 A01 947 0   73M78S  A02 31442037    0   AAACCCTAAACCCTAAACCCTAAACCCTAAACACTAAACCCTAAACCCGAGACCCTAAACCCTAATACCTAATGCGTATAGCCTAGAGGGTGCACCCTAAAGGCTATACCGGAGGGGGGGGGGCGGGGGGGGGGGGGGGGGGGGGGGGGGG FFFFFFFFF:FFFFFFFFFFFFF:FFFFF::F,:,,,,,:F,F,,,,F,F,,F,,,:FF::,:,,,,,FF,F,,,,FF,,,:,,,,,,,,F,,,,,::,,,,,F,:,:,,,F,,,F,:,,,FF,,F,F::F,FF,::F::,:FFFFF:FF, NM:i:4  MD:Z:32C15T1A14A7   MC:Z:45M106S    AS:i:53 XS:i:52SA:Z:A04,11090195,+,115S36M,0,0;

129 A01 947 0   73M78S  A02 31442037    0   AAACCCTAAACCCTAAACCCTAAACCCTAAACACTAAACCCTAAACCCGAGACCCTAAACCCTAATACCTAATGCGTATAGCCTAGAGGGTGCACCCTAAAGGCTATACCGGAGGGGGGGGGGCGGGGGGGGGGGGGGGGGGGGGGGGGGG FFFFFFFFF:FFFFFFFFFFFFF:FFFFF::F,:,,,,,:F,F,,,,F,F,,F,,,:FF::,:,,,,,FF,F,,,,FF,,,:,,,,,,,,F,,,,,::,,,,,F,:,:,,,F,,,F,:,,,FF,,F,F::F,FF,::F::,:FFFFF:FF, NM:i:4  MD:Z:32C15T1A14A7   MC:Z:45M106S    AS:i:53 XS:i:52SA:Z:A07,18144323,-,5S36M110S,0,0;

65  A02 31442037    0   45M106S A01 947 0   GTTTAGGGTTTAGGGTTTAGGGTTTAGGGTTTAGGTTTGAGGGGTAAGGGTATATCGTGTAGGCTGTAGGTTTTATGGTGTAGGGTGTATGGTGTAGGGTTTAGGGTGTAGGGGGTAGGGGGTAGGGGGTCGGGGGTGGGGTGGTGGGTGG FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF,,,,,F,F,,,,:,,FF,F,,,,F,,FF:,,,FF,,,,,,F:,:,,FFFF,,,:,,F:,,,:FF,F:FFFFFF:,,FFF,:,,FFF,F,,FFF,F,,:FF,F,,FFF,:,,FFF,F: NM:i:1  MD:Z:38T6   MC:Z:73M78S AS:i:40 XS:i:38

65  A02 31442037    0   45M106S A01 947 0   GTTTAGGGTTTAGGGTTTAGGGTTTAGGGTTTAGGTTTGAGGGGTAAGGGTATATCGTGTAGGCTGTAGGTTTTATGGTGTAGGGTGTATGGTGTAGGGTTTAGGGTGTAGGGGGTAGGGGGTAGGGGGTCGGGGGTGGGGTGGTGGGTGG FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF,,,,,F,F,,,,:,,FF,F,,,,F,,FF:,,,FF,,,,,,F:,:,,FFFF,,,:,,F:,,,:FF,F:FFFFFF:,,FFF,:,,FFF,F,,FFF,F,,:FF,F,,FFF,:,,FFF,F: NM:i:1  MD:Z:38T6   MC:Z:73M78S AS:i:40 XS:i:38

385 A04 11090195    0   115H36M A02 31442037    0   GGGGGGGGCGGGGGGGGGGGGGGGGGGGGGGGGGGG    F,:,,,FF,,F,F::F,FF,::F::,:FFFFF:FF,    NM:i:0  MD:Z:36 MC:Z:45M106H    AS:i:36 XS:i:36 SA:Z:A01,947,+,73M78S,0,4;

401 A07 18144323    0   5H36M110H   A02 31442037    CCCCCCCCCCCCCCCCCCCCCCGCCCCCCCCCCTCC    FFFF:,::F::,FF,F::F,F,,FF,,,:,F,,,F,    NM:i:0  MD:Z:36 MC:Z:45M106H    AS:i:36 XS:i:36 SA:Z:A01,947,+,73M78S,0,4;

0
Entering edit mode

where is the "the-31-digit-sequence" in the first column ??

0
Entering edit mode

I didn't copy the first column with that information- I didn't know that sequence is important, but it was there in the first column. Would that help to figure out what the issue is?

0
Entering edit mode

It was A00975:57:HHGM7DRXX:2:2122:1922:36949 for each of the 6 rows

0
Entering edit mode

this doesn't look like a SAM file; For example the cigar string 73M78S should be in the 6th column while your example it's in the 5th column.

1
Entering edit mode
8 weeks ago

so you have the same A00975:57:HHGM7DRXX:2:2122:1922:36949 at multiple times.

A00975:57:HHGM7DRXX:2:2122:1922:36949 129 A01 947 0   73M78S  A02 31442037    0   AAACCCTAAACCCTAAACCCTAAACCCTAAACACTAAACCCTAAACCCGAGACCCTAAACCCTAATACCTAATGCGTATAGCCTAGAGGGTGCACCCTAAAGGCTATACCGGAGGGGGGGGGGCGGGGGGGGGGGGGGGGGGGGGGGGGGG FFFFFFFFF:FFFFFFFFFFFFF:FFFFF::F,:,,,,,:F,F,,,,F,F,,F,,,:FF::,:,,,,,FF,F,,,,FF,,,:,,,,,,,,F,,,,,::,,,,,F,:,:,,,F,,,F,:,,,FF,,F,F::F,FF,::F::,:FFFFF:FF, NM:i:4  MD:Z:32C15T1A14A7   MC:Z:45M106S    AS:i:53 XS:i:52SA:Z:A04,11090195,+,115S36M,0,0;

A00975:57:HHGM7DRXX:2:2122:1922:36949 129 A01 947 0   73M78S  A02 31442037    0   AAACCCTAAACCCTAAACCCTAAACCCTAAACACTAAACCCTAAACCCGAGACCCTAAACCCTAATACCTAATGCGTATAGCCTAGAGGGTGCACCCTAAAGGCTATACCGGAGGGGGGGGGGCGGGGGGGGGGGGGGGGGGGGGGGGGGG FFFFFFFFF:FFFFFFFFFFFFF:FFFFF::F,:,,,,,:F,F,,,,F,F,,F,,,:FF::,:,,,,,FF,F,,,,FF,,,:,,,,,,,,F,,,,,::,,,,,F,:,:,,,F,,,F,:,,,FF,,F,F::F,FF,::F::,:FFFFF:FF, NM:i:4  MD:Z:32C15T1A14A7   MC:Z:45M106S    AS:i:53 XS:i:52SA:Z:A07,18144323,-,5S36M110S,0,0;


the first and the second have the very same sequence, the very same sam flag 129 = read paired (0x1) and second in pair (0x80) ) == it is impossible to find the very same read twice in a BAM file. there is something broken in your upstream workflow.

funilly the first read have the very same sequence (AAACCCTAAACCCTAAACCC(...)GG) are mapped are the very same place but don't have the same SA attributes 'SA:Z:A04,11090195,+,115S36M,0,0;' vs 'SA:Z:A07,18144323,-,5S36M110S,0,0;'

0
Entering edit mode

Thanks! This is very confusing. I did not get this error when I ran ValidateSamFile on the very same sample's sam file- does this mean that specifically when I convert from sam to bam with samtools this error is occurring and somehow reads are getting handled incorrectly?

0
Entering edit mode

how did you run bwa mem ?

try to find the read "A00975:57:HHGM7DRXX:2:2122:1922:36949" in your fastqs. Check the sequence, check the occurence.

0
Entering edit mode

Okay, so I found 4 matches for the read ID "A00975:57:HHGM7DRXX:2:2122:1922:36949" in my raw fastqs: 2 were the sequence starting with GTTT and 2 were the sequence starting with AAAC. The other two sequences were not found.

The basic steps to my pipeline were to:

1. aligned to reference genome with bwa - mem: bwa mem -t8 -M Brapa_v3.0.fasta Library-17_S17_L001_R1_001.fastq.gz Library-17_S17_L001_R2_001.fastq.gz > Library-17_S17_L001.paired.aligned.sam

Here Brapa_v3.0.fasta is the referene genome, Library-17_S17_L001_R1_001.fastq.gz is the forward read file and Library-17_S17_L001_R2_001.fastq.gz is the reverse read file from paired end sequencing. I also repeated this step for a second pair of files of Library-17, L002, which was a technical replicate ran on another flow cell to get enough coverage of the library.

1. sorted each sam file with samtools sort
2. merged the two sam files (technical replicates) with samtools merge
3. sorted the resulting, merged sam file and converted to bam with samtools sort
4. indexed final bam file with samtools index

So I have 6 reads with the same read ID in my final bam file. Interestingly, these 4 reads with the same ID that I found in my raw fastqs all came from the Lane 2 fastq files- so the duplicate reads don't appear to be due to the technical replicate. And again, the 5th and 6th duplications of the same read ID were not found in the raw fastq files.

Do you still believe this to be an issue with the pipeline? I'm not sure what to make of 4 duplicate read IDs in the raw fastq files AND 2 additional duplicates that show up in the resulting bam file?

0
Entering edit mode

Okay, so I found 4 matches for the read ID "A00975:57:HHGM7DRXX:2:2122:1922:36949" in my raw fastqs: 2 were the sequence starting with GTTT and 2 were the sequence starting with AAAC. The other two sequences were not found.

this is your problem, you're asking ValidateSamFile to do his job. And his job is to check that two reads don't share the same ID. There is a problem with your upstream process that generated the fastq files.

0
Entering edit mode

Okay, so this issue is something that was already present when I got these raw fastq files from the sequencing facility and not with my pipeline. That is what you think based on the duplicate reads in the raw fastq files, correct?