Dear All,
I am using BAM files for chip-seq analysis. The chromosome notation in a usual BAM file is like: chr1. In my file the chromosomme notation is 1. Is there a way to change that in the BAM file? For further analysis it is very important to change the notation.
Many thanks!
Greetz Lisanne
I know it works, but i would like to understand it a bit better, if possible. I tried the same without the awk and tr and it also works fine. the sed command add the 'chr' into the right place. What exactly does awk do afterwards?
the 7th column contains the name of the chromosome. You have to add the chr prefix if this value is different from "=" or "*"
ok. I get it. somehow my bam file has the chromosomes on the 3rd. column:
On the 7th column I have only
[*|=]
My problem was that it added in my header on the 7th column also a 'chr'.
@Pierre This command works fine, but what if you want to save the changes so that you create a new .bam file with the new chr notations? I just putted " > newfile.bam" at the end (after the final " in the tr command) but that doesn't seem to work. Am I overlooking something?
there's a command 'replace header' for samtools. See the doc.
@Pierre, Could you please give me an example of how to replace the chromosome names with the sed or awk commands? Because in the samtools view i can not find the commands. Thanks you!
@Pierre, is there a concern about sequences that may not be the same in the aligned BAM and in the new ref fasta she wants to set? I mean, perhaps, alignment has been made with a patched version or something else... in a word are we sure by doing this that a particular position is exactly the same in both references ?
quick note: not all sed versions can match a tab with t ... pretty annoying I agree ... in those case one should rewrite the sed line to awk with gsub
@Pierre I tried the code but something was not right. It really replaced '1' with "chr1". But next I failed to get the index file for the bam file. The error is "samtools index: file.bam is in a format that cannot be usefully indexed". I used samtools view -H to check @HD VN:1.3 SO:coordinate chr @SQ SN:chr1 LN:249250621 AS:NCBI37 M5:1b22b98cdeb4a9304cb5d48026a85128 UR:file:/net/1000g/mktrost/seqshop/gotcloud/gotcloud.ref/human.g1k.v37.fa chr
I guess the issue comes with the awk line of the code as it checks the 7th column; being either "=" or "*" and adds "chr" if doesn't match. Then the header lines are also added a "chr" on the 7th column. I don't know about the speed but modifying the awk and tr lines as below could work
awk -F '\t' 'BEGIN{OFS="\t";} { if ($1 !~ /^@/ && $7 != "=" && $7 != "*") {$7 = "chr"$7; print$0;} else print$0}'
what is the output of the following command
?