Question: Duplicate @Sq Lines In Sam File Header
1
gravatar for RamRS
6.0 years ago by
RamRS24k
Houston, TX
RamRS24k wrote:

Hi,

I'm working on RNA-seq data and have aligned my reads to the assembled transcript using bwa-mem. When I tried to run Picard Tools (SamFormatConverter.jar), the SAM dictionary threw an exception "Cannot add sequence that already exists in SAMSequenceDictionary".

I then checked for duplication in the SAM file header and sure enough, two entries were duplicated. How do I correct this? Commands used are given below:

bwa index contigs.fa bwa mem contigs.fa reads_1.fq reads_2.fq >samFile.sam
java -Xmx4g -jar SamFormatConverter.jar INPUT=samFile.sam OUTPUT=bamFile.bam

This is where I faced the error.

I checked for duplicates using:

samtools view -HS samFile.sam | sort | uniq -c

2 @SQ entries were present >1 times.

Any help would be appreciated, thanks!

samtools duplicates • 4.5k views
ADD COMMENTlink modified 2.8 years ago • written 6.0 years ago by RamRS24k
1

I do not know whether it's right nor not, but when using bwa mem, you do not use -M parameter, Mark short split hits as secondary (which for Picard compatibility), may be you should try this before go to picard.

ADD REPLYlink modified 6.0 years ago • written 6.0 years ago by liupfskygre190
1

Hi, A fresh run with the -M option turned on still gives the error when I try Picard Tools! :-(

ADD REPLYlink written 6.0 years ago by RamRS24k

I think that's the problem! I'll run bwa mem again with the -M option. Thank you!

ADD REPLYlink written 6.0 years ago by RamRS24k
6
gravatar for John Marshall
6.0 years ago by
John Marshall1.7k
Glasgow, Scotland
John Marshall1.7k wrote:

The @SQ headers that bwa emits come directly from the reference sequences in the index you're mapping against.

So the likely explanation for your problem is that the same entries are duplicated in your _contigs.fa_ file. You can check by looking at the > headers in your fasta file:

grep '>' contigs.fa | sort | uniq -c

(If there are header lines like >foo blah and >foo blah blah then these will lead to duplicate @SQ SN:foo SAM headers while not being shown as exact duplicate lines by that uniq command. So you may have to be a little careful in interpreting the results of that command.)

ADD COMMENTlink written 6.0 years ago by John Marshall1.7k
1

that was my first thought, something up with the contigs file. (full disclosure, I am involved in this project and had no problem running picard post BWA-mem when I was using the trinity.fasta file) These contigs are really ORFs that were generated from a subset of the Trinity contigs, using EMBOSS)

ADD REPLYlink modified 6.0 years ago • written 6.0 years ago by lzwright150
1

Right on target! I was working with a manually created contigs.fa file and I guess I extracted a couple of sequences twice from the source super-set. Silly me! Thank you so much, @John Marshall!

ADD REPLYlink written 6.0 years ago by RamRS24k

the "source super-set" I like that :)

ADD REPLYlink written 6.0 years ago by lzwright150
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1920 users visited in the last hour