Change chromosome notation in bam file
4
0
Entering edit mode
8 months ago
Sara • 0

Hello,

I need to re-format a bam file, specifically to change the chromosome notation from gene__chr[1-9XYM] to chr[1-9XYM]. What I have is the following:

@HD VN:1.6  SO:coordinate
@SQ SN:A1BG-AS1__chr19  LN:2134

and

NB500901:267:HH2TMBGXG:1:21311:5296:17191:CAAGGTGT:TGCGCC:146   16  A1BG-AS1__chr19 3   9   22S31M7S    *   0   0   TTTTGTTTTGTTTTGTTTTGTTTTTTTAGTAGAGACGGGGTTTCGTCATGTTGCTCAGGC    EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEAAEEEEEEEEEEEEAAAAAA    NM:i:0  MD:Z:31 AS:i:31 XS:i:28 NH:i:1  CB:Z:CAAGGTGT   UB:Z:TGCGCC

My desired output is something like:

@HD VN:1.6  SO:coordinate
@SQ SN:chr19    LN:2134

and

NB500901:267:HH2TMBGXG:1:21311:5296:17191:CAAGGTGT:TGCGCC:146   16  chr19   3   9   22S31M7S    *   0   0   TTTTGTTTTGTTTTGTTTTGTTTTTTTAGTAGAGACGGGGTTTCGTCATGTTGCTCAGGC    EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEAAEEEEEEEEEEEEAAAAAA    NM:i:0  MD:Z:31 AS:i:31 XS:i:28 NH:i:1  CB:Z:CAAGGTGT   UB:Z:TGCGCC

I've tried to do it with a combination of samtools and awk, like this:

samtools view -h input.bam | awk '{split($3,tmp,"__");if($0 ~ /^@/){$2="\tSN:"tmp[2]; print $0}else{$3="\t"tmp[2]}; print $0}}' | samtools view -Shb - > output.bam

However, I continue to get errors. I would highly appreciate any comment/suggestion!

Thanks a lot :)

file bam notation • 691 views
ADD COMMENT
1
Entering edit mode
8 months ago

You should be able to remove it with sed samtools view -h input.bam | sed -r 's/[A-Za-z0-9\-]+__//g'. Someone might have to correct my sed regex though, it has a few quirks I can never remember off the top of my head.

ADD COMMENT
1
Entering edit mode
8 months ago

I think the trick is in the bam, the chromosomes name isn't actually represented in the read data, it just points to the header. So try changing only the header and convert to and from bam format. That might work.

Otherwise, you'll need to say what errors you actually got. Saying "I got errors" isn't helpful.

ADD COMMENT
0
Entering edit mode

Thank you for your comments. I'm sorry for that, the error I get is:

[main_samview] fail to read the header from "input.bam".
[main_samview] fail to read the header from "-".

I've tried rpolicastro 's suggestion and it actually works when I run the command and only print head, like:

samtools view -h input.bam | sed -r 's/[A-Za-z0-9\-]+__//g' | head

I get

@HD VN:1.6  SO:coordinate
@SQ SN:chr19    LN:2134

But then, I tried to save it as a bam: samtools view -h input.bam | sed -r 's/[A-Za-z0-9\-]+__//g' | samtools view -Shb - > output.bam, and it says there're duplicates sequences "chr[1-9XYM]" and "samtools view: failed to add PG line to the header"

ADD REPLY
1
Entering edit mode
8 months ago
cmdcolin ★ 2.5k

This is what samtools reheader is made to do http://www.htslib.org/doc/samtools-reheader.html

See also other threads on the biostars forum about this

The basic idea is:

samtools view -H file.bam > header.txt
cat header.txt | yourawkcommandhere > newheader.txt
samtools reheader newheader.txt file.bam > newfile.bam

You can also manually edit header.txt with a text editor instead of fancy awk commands too

Using samtools reheader will be much faster than doing sed on the entire SAM file also :)

ADD COMMENT
0

Login before adding your answer.

Traffic: 2923 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6