Question: Edit every instance of a chromosome name in BAM file?
0
gravatar for edsouza
3.2 years ago by
edsouza30
edsouza30 wrote:

Hi.

I performed an RNAseq analysis using RSEM on a computer cluster. It made an output BAM file, which I downloaded and tried to view in the Genome Viewer so I could verify whether the expression that it mapped really maps to the established genes. However, when I imported my BAM file, it gave an error, since my organism's default genome expects chromosomes in the form "chr1, chr2, chr3" whereas my BAM file has "I, II, III".

I tried editing the header file with Samtools using this command:

samtools view -H file.bam | sed -e 's/SN:III/SN:chr3/' -e 's/SN:II/SN:chr2/' -e 's/SN:I/SN:chr1/' | samtools reheader - file.bam > newfile.bam

However, it still didn't work, giving me the same error. I converted it to SAM and opened it up and basically found that while the header files contain my new naming convention, the rest of the file still uses Roman numerals. Is there a simple way that I can rename all instances of the old chromosome naming convention?

Thanks.

ADD COMMENTlink modified 3.2 years ago by igor8.1k • written 3.2 years ago by edsouza30

Because you used -H option for samtools view only headers got converted.

ADD REPLYlink written 3.2 years ago by genomax70k

Mmhh I think what the OP has done should work. S/he used -H to edit only the header of the original bam file and the edited header sent to samtools reheader.

ADD REPLYlink written 3.2 years ago by dariober10k

But that left the identifiers in the rest of the file as they were. @edsouza was trying to change them in the entire file.

ADD REPLYlink modified 3.2 years ago • written 3.2 years ago by genomax70k

Right. Is there an option to sed all the instances of Roman Numerals throughout the body of the file too?

ADD REPLYlink written 3.2 years ago by edsouza30

There are no identifiers in the rest of the file. The only identifiers are in the header.

ADD REPLYlink written 3.2 years ago by Devon Ryan91k

Hi Devon,

That's what I first thought. However, after doing a head for the first 500 lines of the original file, I got data in this form:

@HD     VN:1.4  SO:unknown
@PG     ID:RSEM
@CO     This BAM file is processed by rsem-tbam2gam to convert from transcript coordinates into genomic coordinates.
@SQ     SN:I    LN:5579133
@SQ     SN:II   LN:4539804
@SQ     SN:III  LN:2452883
@SQ     SN:MT   LN:19431
[Some omitted info over here]
NS500730:210:HC2NGAFXX:1:11101:2888:1102        163     III     89997   100     75M     =       90072   150     CCATCATCAATAACATGATCATTGGCGTTACCCAAGGTTTCCGCTACAAGATGCGTCTTGTCTATGCTCACTTTC     AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE<EEEEA     XA:i:0  MD:Z:75 NM:i:0  ZW:f:1
NS500730:210:HC2NGAFXX:1:11101:18169:1103       83      II      2924015 100     64M154N11M      =   2923909     -181
GGTAAATTTGAATATAAAAGCAATGAGATATCTTTCTGCCGAGGATTTTCGCACTCTGACAGCTGTAGAAATGGG     AEEEEEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAAAAA     XA:i:0  MD:Z:75 NM:i:0  ZW:f:1  XS:A:+

The column that says "II" or "III" is a reference to the chromosome, and appears in more than just the header.

ADD REPLYlink written 3.2 years ago by edsouza30

You didn't modify the original file, you made a new file with the modified header. The SAM output is distinct from the underlying BAM file. In the BAM file, the alignments each have an index into the header saying which chromosome they're on. When that's printed out as a SAM file, the associated chromosome name is then used (the same goes for = in the 7th column).

ADD REPLYlink written 3.2 years ago by Devon Ryan91k

Ah! It seems that I'd been confusing the BAM file with the unpacked SAM file. Meaning, when you unpack BAM to SAM, all that matters is the BAM header; is this correct? I'm going to re-download the edited bam file and see if it works in IGV.

ADD REPLYlink written 3.2 years ago by edsouza30

Correct, all that matters in BAM is the header. That's one of the reasons that samtools reheader can be as quick as it is (and also why BAM files aren't larger).

ADD REPLYlink written 3.2 years ago by Devon Ryan91k

BTW, there's apparently now an -i option that can be used to tell samtools reheader to modify a file in place (it's a new option to me, though perhaps I'd just never noticed it before).

ADD REPLYlink written 3.2 years ago by Devon Ryan91k

duplicate of Bam File: Change Chromosome Notation

ADD REPLYlink written 3.2 years ago by Pierre Lindenbaum122k

Hi, that's actually the link I based my script off of. However, like @genomax2 said, the identifiers in the rest of the file have stayed the same.

ADD REPLYlink written 3.2 years ago by edsouza30

You used -H instead of -h with samtools view (@Pierre's example in other thread is the correct usage). Small difference but a very different result.

ADD REPLYlink modified 3.2 years ago • written 3.2 years ago by genomax70k

My guess is that samtools reheader isn't dealing nicely with the pipe. Write the modified header to a file and then try reheader. This will also allow you to double check that you got everything with sed.

ADD REPLYlink modified 3.2 years ago • written 3.2 years ago by Devon Ryan91k
0
gravatar for igor
3.2 years ago by
igor8.1k
United States
igor8.1k wrote:

You can use CrossMap to convert between builds: http://crossmap.sourceforge.net/

You'll have to make a custom chain file, but that should be fairly easy if you look at the examples like GRCh37ToHg19.over.chain.gz where only chromosome names get changed.

ADD COMMENTlink written 3.2 years ago by igor8.1k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1079 users visited in the last hour