Edit and re-head BAM file
1
1
Entering edit mode
11 months ago
Matteo Ungaro ▴ 100

Hi there I have a BAM file which needs to be edited and re-headed. Now, I'm aware of how to do so the problem is that for some reason the sed command I'm using does not catch the sequence I have to remove... Below, an example of the BAM header

@HD     VN:1.5  SO:coordinate
@SQ     SN:chr12:0-133265294#0  LN:133265294
@SQ     SN:chr12:0-17769809#0   LN:17769809
@SQ     SN:chr12:0-34719407#0   LN:34719407
@SQ     SN:chr12:0-86846879#0   LN:86846879
@SQ     SN:chr12:17859256-133275309#0   LN:115416053
@SQ     SN:chr12:37246182-133275309#0   LN:96029127
@SQ     SN:chr12:86858727-133275309#0   LN:46416582

As you can see, the @SQ lines are in the format of SN:<chr_number>:<region>#0 what I really need is to get rid of the :<region>#0 to run an evaluation against the GIAB benchmark set, which uses the chr_number format. What I've been running is the following

samtools view -H HG002_chr.bam | \
   sed -E 's/:[[:digit:]]+-[[:digit:]]+#0//' | \
   samtools reheader - HG002_chr.bam > test.bam

Specifically, for some reason, I'm getting an error related to the header containing duplicate sequences. Does anyone knows a way around it, granted the command I'm running is the correct one? Thanks in advance!

samtools BAM • 1.1k views
ADD COMMENT
0
Entering edit mode

if you are running against GIAB then perhaps what you need to change is the VCF file chromosome naming, and you can do that in multiple ways

use the current BAM file to make the VCF then rename the chromosomes in the VCF

ADD REPLY
0
Entering edit mode

Thanks, I appreciate the suggestion. Indeed, this is what I did at first; however, hap.py then complains about regions in the VCF pointing to contigs in the reference which have the same header because they are fragments of the original input reference. This is the result of extracting the reference from a pangenome graph and, unfortunately, there is not really a way around it; it doesn't preserve contigs names and identity and when I rename them to match the format chr_number this is the result.

Alternatively, I could edit the GIAB truth set VCF but this would require specific knowledge of what-is-what in the two VCF file for making calls against the same exact regions to benchmark the effectiveness of the alignment tools used.

ADD REPLY
1
Entering edit mode
11 months ago

I'm getting an error related to the header containing duplicate sequences

SN:chr12:0-133265294#0 SN:chr12:0-17769809#0 etc... will be all replaced by chr12 ..

ADD COMMENT
0
Entering edit mode

Yes that should be the intended results, but it not possible to get a meaningful file. In fact, starting from a ~60GB BAM after the process fails I get a 28KByte file. Although this might be something to prevent duplicated headers in a BAM file, I wonder whether there is any way to force that/around. Thanks again.

ADD REPLY
1
Entering edit mode

No, the specification says:

The SN tags and all individual AN names in all @SQ lines must be distinct.

So multiple chr12 names are not going to work. Also, the alignments will all have the old reference names so it is not going to be a valid bam file anyway.

ADD REPLY
0
Entering edit mode

I see, thanks both aw7 and @Pierre Lindenbaum. I wanted confirmation whether there was any remote chance it would work just to make things a bit faster... I will need to regenerate the BAM so that each aligned sequence is unique.

ADD REPLY

Login before adding your answer.

Traffic: 1978 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6