Edit every instance of a chromosome name in BAM file?
1
0
Entering edit mode
7.8 years ago
edsouza ▴ 30

Hi.

I performed an RNAseq analysis using RSEM on a computer cluster. It made an output BAM file, which I downloaded and tried to view in the Genome Viewer so I could verify whether the expression that it mapped really maps to the established genes. However, when I imported my BAM file, it gave an error, since my organism's default genome expects chromosomes in the form "chr1, chr2, chr3" whereas my BAM file has "I, II, III".

I tried editing the header file with Samtools using this command:

samtools view -H file.bam | sed -e 's/SN:III/SN:chr3/' -e 's/SN:II/SN:chr2/' -e 's/SN:I/SN:chr1/' | samtools reheader - file.bam > newfile.bam

However, it still didn't work, giving me the same error. I converted it to SAM and opened it up and basically found that while the header files contain my new naming convention, the rest of the file still uses Roman numerals. Is there a simple way that I can rename all instances of the old chromosome naming convention?

Thanks.

sequence bam samtools RNA-Seq genome • 6.1k views
ADD COMMENT
0
Entering edit mode

Because you used -H option for samtools view only headers got converted.

ADD REPLY
0
Entering edit mode

Mmhh I think what the OP has done should work. S/he used -H to edit only the header of the original bam file and the edited header sent to samtools reheader.

ADD REPLY
0
Entering edit mode

But that left the identifiers in the rest of the file as they were. @edsouza was trying to change them in the entire file.

ADD REPLY
0
Entering edit mode

Right. Is there an option to sed all the instances of Roman Numerals throughout the body of the file too?

ADD REPLY
0
Entering edit mode

There are no identifiers in the rest of the file. The only identifiers are in the header.

ADD REPLY
0
Entering edit mode

Hi Devon,

That's what I first thought. However, after doing a head for the first 500 lines of the original file, I got data in this form:

@HD     VN:1.4  SO:unknown
@PG     ID:RSEM
@CO     This BAM file is processed by rsem-tbam2gam to convert from transcript coordinates into genomic coordinates.
@SQ     SN:I    LN:5579133
@SQ     SN:II   LN:4539804
@SQ     SN:III  LN:2452883
@SQ     SN:MT   LN:19431
[Some omitted info over here]
NS500730:210:HC2NGAFXX:1:11101:2888:1102        163     III     89997   100     75M     =       90072   150     CCATCATCAATAACATGATCATTGGCGTTACCCAAGGTTTCCGCTACAAGATGCGTCTTGTCTATGCTCACTTTC     AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE<EEEEA     XA:i:0  MD:Z:75 NM:i:0  ZW:f:1
NS500730:210:HC2NGAFXX:1:11101:18169:1103       83      II      2924015 100     64M154N11M      =   2923909     -181
GGTAAATTTGAATATAAAAGCAATGAGATATCTTTCTGCCGAGGATTTTCGCACTCTGACAGCTGTAGAAATGGG     AEEEEEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAAAAA     XA:i:0  MD:Z:75 NM:i:0  ZW:f:1  XS:A:+

The column that says "II" or "III" is a reference to the chromosome, and appears in more than just the header.

ADD REPLY
0
Entering edit mode

You didn't modify the original file, you made a new file with the modified header. The SAM output is distinct from the underlying BAM file. In the BAM file, the alignments each have an index into the header saying which chromosome they're on. When that's printed out as a SAM file, the associated chromosome name is then used (the same goes for = in the 7th column).

ADD REPLY
0
Entering edit mode

Ah! It seems that I'd been confusing the BAM file with the unpacked SAM file. Meaning, when you unpack BAM to SAM, all that matters is the BAM header; is this correct? I'm going to re-download the edited bam file and see if it works in IGV.

ADD REPLY
0
Entering edit mode

Correct, all that matters in BAM is the header. That's one of the reasons that samtools reheader can be as quick as it is (and also why BAM files aren't larger).

ADD REPLY
0
Entering edit mode

BTW, there's apparently now an -i option that can be used to tell samtools reheader to modify a file in place (it's a new option to me, though perhaps I'd just never noticed it before).

ADD REPLY
0
Entering edit mode
ADD REPLY
0
Entering edit mode

Hi, that's actually the link I based my script off of. However, like @genomax2 said, the identifiers in the rest of the file have stayed the same.

ADD REPLY
0
Entering edit mode

You used -H instead of -h with samtools view (@Pierre's example in other thread is the correct usage). Small difference but a very different result.

ADD REPLY
0
Entering edit mode

My guess is that samtools reheader isn't dealing nicely with the pipe. Write the modified header to a file and then try reheader. This will also allow you to double check that you got everything with sed.

ADD REPLY
0
Entering edit mode
7.8 years ago
igor 13k

You can use CrossMap to convert between builds: http://crossmap.sourceforge.net/

You'll have to make a custom chain file, but that should be fairly easy if you look at the examples like GRCh37ToHg19.over.chain.gz where only chromosome names get changed.

ADD COMMENT

Login before adding your answer.

Traffic: 2742 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6