Question: Bam File: Change Chromosome Notation
6
gravatar for Lisanne
6.7 years ago by
Lisanne100
Lisanne100 wrote:

Dear All,

I am using BAM files for chip-seq analysis. The chromosome notation in a usual BAM file is like: chr1. In my file the chromosomme notation is 1. Is there a way to change that in the BAM file? For further analysis it is very important to change the notation.

Many thanks!

Greetz Lisanne

bam samtools • 13k views
ADD COMMENTlink modified 3.4 years ago by petervangalen20 • written 6.7 years ago by Lisanne100
6
gravatar for Pierre Lindenbaum
6.6 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum108k wrote:

You can use samtools view to dump your data with the header : http://samtools.sourceforge.net/samtools.shtml and replace the chromosomes with sed or awk.

(not tested but it could be something like this:)

samtools view -h file.bam |\
sed -e '/^@SQ/s/SN\:/SN\:chr/' -e '/^[^@]/s/\t/\tchr/2'|\
awk -F ' ' '$7=($7=="=" || $7=="*"?$7:sprintf("chr%s",$7))' |\
tr " " "\t"

You could also use PICARD ReplaceSamHeader : http://picard.sourceforge.net/command-line-overview.shtml#ReplaceSamHeader

if you need to work with samtools or the gatk, another hack is to create a 'mock' faidx index to the reference genome. See my blog: http://plindenbaum.blogspot.com/2011/10/reference-genome-with-or-without-chr.html

ADD COMMENTlink modified 6.6 years ago • written 6.6 years ago by Pierre Lindenbaum108k

@Pierre, Could you please give me an example of how to replace the chromosome names with the sed or awk commands? Because in the samtools view i can not find the commands. Thanks you!

ADD REPLYlink written 6.6 years ago by Lisanne100

@Pierre, is there a concern about sequences that may not be the same in the aligned BAM and in the new ref fasta she wants to set? I mean, perhaps, alignment has been made with a patched version or something else... in a word are we sure by doing this that a particular position is exactly the same in both references ?

ADD REPLYlink written 6.6 years ago by toni2.1k

Thanks at @Pierre, that solution works for me! :)

ADD REPLYlink written 6.6 years ago by Lisanne100

quick note: not all sed versions can match a tab with t ... pretty annoying I agree ... in those case one should rewrite the sed line to awk with gsub

ADD REPLYlink written 6.6 years ago by Istvan Albert ♦♦ 77k

I know it works, but i would like to understand it a bit better, if possible. I tried the same without the awk and tr and it also works fine. the sed command add the 'chr' into the right place. What exactly does awk do afterwards?

ADD REPLYlink written 6.6 years ago by Assa Yeroslaviz1.1k

the 7th column contains the name of the chromosome. You have to add the chr prefix if this value is different from "=" or "*"

ADD REPLYlink written 6.6 years ago by Pierre Lindenbaum108k

ok. I get it. somehow my bam file has the chromosomes on the 3rd. column: HWI-EAS225_0031_FC:2:1:19139:1546#0 99 2R...
HWI-EAS225_0031_FC:2:1:19139:1546#0 147 2R... On the 7th column I have only [*|=] My problem was that it added in my header on the 7th column also a 'chr'.

ADD REPLYlink written 6.6 years ago by Assa Yeroslaviz1.1k

@Pierre This command works fine, but what if you want to save the changes so that you create a new .bam file with the new chr notations? I just putted " > newfile.bam" at the end (after the final " in the tr command) but that doesn't seem to work. Am I overlooking something?

ADD REPLYlink written 5.8 years ago by User6891200

there's a command 'replace header' for samtools. See the doc.

ADD REPLYlink written 5.8 years ago by Pierre Lindenbaum108k

Could you give some more information please about what to do? Because I'm not quite understanding what you mean.

I do understand what sed, awk and tr are doing, but i just don't know how to save this output into a new .bam file with which I can go further.

ADD REPLYlink written 5.8 years ago by User6891200

you should ask this as another question

ADD REPLYlink written 5.8 years ago by Istvan Albert ♦♦ 77k
2
gravatar for petervangalen
3.4 years ago by
United States
petervangalen20 wrote:

The following line works well for me. It will replace the names of chromosome 1 to 22, X, Y and MT (mitochondrial DNA). It will create a new file with the name test_chr.bam instead of test.bam

use Samtools
samtools view -H test.bam | sed -e 's/SN:1/SN:chr1/' | sed -e 's/SN:2/SN:chr2/' | sed -e 's/SN:3/SN:chr3/' | sed -e 's/SN:4/SN:chr4/' | sed -e 's/SN:5/SN:chr5/' | sed -e 's/SN:6/SN:chr6/' | sed -e 's/SN:7/SN:chr7/' | sed -e 's/SN:8/SN:chr8/' | sed -e 's/SN:9/SN:chr9/' | sed -e 's/SN:10/SN:chr10/' | sed -e 's/SN:11/SN:chr11/' | sed -e 's/SN:12/SN:chr12/' | sed -e 's/SN:13/SN:chr13/' | sed -e 's/SN:14/SN:chr14/' | sed -e 's/SN:15/SN:chr15/' | sed -e 's/SN:16/SN:chr16/' | sed -e 's/SN:17/SN:chr17/' | sed -e 's/SN:18/SN:chr18/' | sed -e 's/SN:19/SN:chr19/' | sed -e 's/SN:20/SN:chr20/' | sed -e 's/SN:21/SN:chr21/' | sed -e 's/SN:22/SN:chr22/' | sed -e 's/SN:X/SN:chrX/' | sed -e 's/SN:Y/SN:chrY/' | sed -e 's/SN:MT/SN:chrM/' | samtools reheader - test.bam > test_chr.bam

If you want to do this for a bunch of bam files at once, use a loop. The variable $filename is your file without the extension (assuming there is no period except the one in ".bam").

for file in *.bam; do filename=`echo $file | cut -d "." -f 1`; samtools view -H $file | sed -e 's/SN:1/SN:chr1/' | sed -e 's/SN:2/SN:chr2/' | sed -e 's/SN:3/SN:chr3/' | sed -e 's/SN:4/SN:chr4/' | sed -e 's/SN:5/SN:chr5/' | sed -e 's/SN:6/SN:chr6/' | sed -e 's/SN:7/SN:chr7/' | sed -e 's/SN:8/SN:chr8/' | sed -e 's/SN:9/SN:chr9/' | sed -e 's/SN:10/SN:chr10/' | sed -e 's/SN:11/SN:chr11/' | sed -e 's/SN:12/SN:chr12/' | sed -e 's/SN:13/SN:chr13/' | sed -e 's/SN:14/SN:chr14/' | sed -e 's/SN:15/SN:chr15/' | sed -e 's/SN:16/SN:chr16/' | sed -e 's/SN:17/SN:chr17/' | sed -e 's/SN:18/SN:chr18/' | sed -e 's/SN:19/SN:chr19/' | sed -e 's/SN:20/SN:chr20/' | sed -e 's/SN:21/SN:chr21/' | sed -e 's/SN:22/SN:chr22/' | sed -e 's/SN:X/SN:chrX/' | sed -e 's/SN:Y/SN:chrY/' | sed -e 's/SN:MT/SN:chrM/' | samtools reheader - $file > ${filename}_chr.bam; done

 

 

ADD COMMENTlink modified 3.4 years ago • written 3.4 years ago by petervangalen20
2
or just
 sed  -e 's/SN:\([0-9XY]*\)/SN:chr\1/' -e 's/SN:MT/SN:chrM/'
ADD REPLYlink written 3.4 years ago by Pierre Lindenbaum108k

Thanks that is much more elegant. With an additional slight improvement my line is now:

for file in *.bam; do filename=`echo $file | cut -d "." -f 1`; samtools view -H $file | sed -e 's/SN:\([0-9XY]\)/SN:chr\1/' -e 's/SN:MT/SN:chrM/' | samtools reheader - $file > ${filename}_chr.bam; done
ADD REPLYlink modified 3.4 years ago • written 3.4 years ago by petervangalen20

Hi petervangalen and Pierre,

I also downloaded BAM files, where prefix "chr" is missing.

Is it necessary to add prefix "chr" only in header or also in each line of BAM/SAM ?

For eg:

samtools view -H HDG04N.exome.bam | sed -e 's/SN:\([0-9XY]*\)/SN:chr\1/' -e 's/SN:MT/SN:chrM/'  | head -5

@SQ    SN:chr1    LN:249250621
@SQ    SN:chr2    LN:243199373
@SQ    SN:chr3    LN:198022430
@SQ    SN:chr4    LN:191154276
@SQ    SN:chr5    LN:180915260

 

 

samtools view HDG04N.exome.bam  | awk 'BEGIN {OFS="\t"} { print $1,$2,"chr"$3,$4,$5,$6,$7,$8,$9,$10,$11,$12,$13,$14,$15 }' | head -4
HWI-ST1033:90:C0K0PACXX:8:2201:8091:30260    163    chr1    10004    8    100M    =    10048    144    CCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCT    CCCFFFFFHHHHHJJJJJJJJJJJJJJJJJJJJJJJJJJIJJJJJGHIIII@FHIIIGIIHGCCFDDFFAECCDB@C?ABDBDCA?BDDC?BDDDDDDDD    NM:i:0    AS:i:100    XS:i:100    RG:Z:1
HWI-ST1033:90:C0K0PACXX:8:2201:8091:30260    83    chr1    10048    8    100M    =    10004    -144    CTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAAC    8883DDB<<9DCA<8+BA<2<5DCA?=5DCA=>.CEB@;)HHHA=)IHGC@.IHCCB6IHDD?JJHGD:JIHFEAJJIHGEJJIGHGHHHGHDFFFFBBB    NM:i:0    AS:i:100    XS:i:93    RG:Z:1
HWI-ST1033:90:C0K0PACXX:8:2302:17841:157944    163    chr1    10332    40    21M1I74M    =    10365    135    CCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCCTAACCCCTAACCCTAACCCTAACCCTAACCCTA    CCCFFFFFHHHGHJJJJJJJJJJJJJJJJJJJJJJJJIJJJJGGIJJIIJEGJCGGIJIIGHGEDDBDCDD@ABBAB<?BBB???BD@BCC?AABC    NM:i:1    AS:i:88    XS:i:88    RG:Z:1
HWI-ST1033:90:C0K0PACXX:8:1304:9484:179936    163    chr1    10343    40    10M1I90M    =    10358    117    CCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCCTAACCCCTAACCCTAACCCTAACCCTAACACTAACCCTAACCCTAACCC    CCCFFFFFHHHHHIIGJJJJIGJJIJIIJJIICHHFHFFGHIJIHCGHIICGGGGG;==ECEFDBDEEAAC@CCD2=<ACB(:>A3@8A?CC8<AACC@?<    NM:i:2    AS:i:88    XS:i:88    RG:Z:1

 

Concatenate the output from both files as SAM and convert to BAM. Or simply converting it only in header is fine.

Thanks !

 

 

ADD REPLYlink written 2.8 years ago by Chirag Nepal2.1k

Perfectly worked for me. Thanks @Pierre Lindenbaum and @petervangalen

ADD REPLYlink written 13 months ago by Rashedul Islam220

@Petervangalen if I had to use the above command for a vcf file then is the below command correct | sed -e 's/SN:1/SN:chr1/' | sed -e 's/SN:2/SN:chr2/' | sed -e 's/SN:3/SN:chr3/' | sed -e 's/SN:4/SN:chr4/' | sed -e 's/SN:5/SN:chr5/' | sed -e 's/SN:6/SN:chr6/' | sed -e 's/SN:7/SN:chr7/' | sed -e 's/SN:8/SN:chr8/' | sed -e 's/SN:9/SN:chr9/' | sed -e 's/SN:10/SN:chr10/' | sed -e 's/SN:11/SN:chr11/' | sed -e 's/SN:12/SN:chr12/' | sed -e 's/SN:13/SN:chr13/' | sed -e 's/SN:14/SN:chr14/' | sed -e 's/SN:15/SN:chr15/' | sed -e 's/SN:16/SN:chr16/' | sed -e 's/SN:17/SN:chr17/' | sed -e 's/SN:18/SN:chr18/' | sed -e 's/SN:19/SN:chr19/' | sed -e 's/SN:20/SN:chr20/' | sed -e 's/SN:21/SN:chr21/' | sed -e 's/SN:22/SN:chr22/' | sed -e 's/SN:X/SN:chrX/' | sed -e 's/SN:Y/SN:chrY/' | sed -e 's/SN:MT/SN:chrM/' | input.vcf > output.vcf

ADD REPLYlink written 17 months ago by rutujagtap0810

Running this command is taking a longer time than expected more than an hour. Is this true or there must be some mistake in the command?

ADD REPLYlink written 17 months ago by rutujagtap0810
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1675 users visited in the last hour