Question: pysam duplicated sequence error
0
gravatar for xcalle91
3.2 years ago by
xcalle9120
European Union
xcalle9120 wrote:

Hi,

I did an aligment with STAR, I got the sam file which I can open easily from the terminal:

samtools view  -S STAR2Aligned.out.sam 

Now I have to work with it in python, so I import the pysam module and try to open the file by:

file = pysam.AlignmentFile("/home/lpp/Desktop/Star_Results/STAR2Aligned.out.sam", "r")

it prints the following error:

[W::sam_hdr_parse] duplicated sequence 'NODE_4_length_21_cov_1.000000'
[W::sam_hdr_parse] duplicated sequence 'NODE_18_length_23_cov_1.000000'

 

I tried to find the error in different forums but the most similar one to my problem has no answer:

http://seqanswers.com/forums/showthread.php?t=58219

Does anybody now where this error is coming from?

thanks in advantage!

 

samfile alignment • 2.0k views
ADD COMMENTlink modified 3.2 years ago by Biostar ♦♦ 20 • written 3.2 years ago by xcalle9120
1

If you do a samtools view -H /home/lpp/Desktop/Star_Results/STAR2Aligned.out.sam | grep NODE_4_length_21_cov_1.000000 then do you get more than one line?

ADD REPLYlink written 3.2 years ago by Devon Ryan88k

It says:

[bam_header_read] EOF marker is absent. The input is probably truncated.
[bam_header_read] invalid BAM binary header (this is not a BAM file).

ADD REPLYlink written 3.2 years ago by xcalle9120

Oh, there should be an -S flag given to samtools as well, mea culpa.

ADD REPLYlink written 3.2 years ago by Devon Ryan88k

thankss,

Now It says:

[samopen] SAM header is present: 1457 sequences.
@SQ    SN:NODE_4_length_21_cov_1.000000    LN:41
@SQ    SN:NODE_4_length_21_cov_1.000000    LN:41

ADD REPLYlink written 3.2 years ago by xcalle9120
1

I should add that it's likely that you have multiple contigs with the same name in your reference genome. This will simply not work, though STAR won't complain (its output, however, will be broken unless the duplicately-named entries also have duplicate sequence...though even then the MAPQ values and such will be wrong).

ADD REPLYlink modified 3.2 years ago • written 3.2 years ago by Devon Ryan88k
2
gravatar for Devon Ryan
3.2 years ago by
Devon Ryan88k
Freiburg, Germany
Devon Ryan88k wrote:

Samtools is correct, you have a duplicate header. I suspect that you actually have the same sequence twice, but you might want to double check that. You can use the -t option with samtools and give it a two column file with the deduplicated contig names and length and then you'll be able to get a BAM file. Alternatively, remove the duplicate fasta sequences and remap. I suspect that the former method will be both faster and easier.

ADD COMMENTlink written 3.2 years ago by Devon Ryan88k

Thanks Devon Ryan, that was the problem. I check the contigs and there were two duplicates. 

ADD REPLYlink written 3.2 years ago by xcalle9120
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2553 users visited in the last hour