Reheadering a subsetted bam gives truncated file
2
0
Entering edit mode
4.4 years ago
Luciferine • 0

Hi all, I have a problem with reheadering a bam file which I subsetted before. I mapped against a merged genome of human and mouse (since it is a patient derived xenograft sample) and want to remove residual mouse reads. I did this with samtools (version 1.8) by simply subsetting for the human chromosomes.

samtools view -b mybam.bam chrM chr1 chr2 chr3 chr4 chr5 chr6 chr7 chr8 chr9 chr10 chr11 chr12 chr13 chr14 chr15 chr16 chr17 chr18 chr19 chr20 chr21 chr22 chrX chrY > no-mouse.bam


I want to remove the mouse chromosomes from the header by removing the respective lines from the header with sed (there are also some ERCCs etc. in the reference, which is why I have to remove so many lines) and reheadering the bam afterwards.

samtools view -H  no-mouse.bam | sed '27,253d' | samtools reheader - no-mouse.bam > no-mouse.rehead.bam


I end up with a truncated bam file, if I want to look at e.g. the end of the file with

samtools view no-mouse.rehead.bam | tail


I get [main_samview] truncated file.Following steps (e.g. AddOrReplaceReadGroups) give errors as well.

I've seen some posts on similar topics but din't find any helpful solution unfortunately.

Can somone help me with that? Thanks a lot!

samtools reheader header bam • 1.8k views
0
Entering edit mode

I mapped against a merged genome of human and mouse

Do you have a reference that this is good practice and reliable? I could imagine that because of homologies between the species, you'll get plenty of multimappers.

3
Entering edit mode

You can probably find a reference in the allele-specific literature, though we also use this strategy in the WGBS world.

In general, if you know your sample is a composite of a couple organisms then concatenating them minimizes the bias.

0
Entering edit mode

Ok, never heard of this strategy. Thanks!

0
Entering edit mode

Here some other things that didn't do the job either:

• Converting to sam, changing header, reconverting to bam
• Sorting the input bam again after subsetting (samtools sort)
• ReorderSam to the human-only reference, this only works if chromosomes in header are the same as in reference.

I get the following error when I start any Picard tools (e.g. ReorderSam, AddOrReplaceReadGroups), maybe this might help to find the problem:

Exception in thread "main" java.lang.IllegalArgumentException: Reference name for '195' not found in sequence dictionary.


The numbers vary in different files, I also got '185' and others.

Hope somebody can help me! Thanks in advance!

0
Entering edit mode

Are you sure that your header order of chromosomes really is the same order as no-mouse.bam?

0
Entering edit mode

That's a good hint, thanks! If I look at the bam file the first reads I get are from chr1, whereas the header starts with chrM. But I don't really know how to fix this. I'll try ReorderSam on the subsetted file with the old header, maybe that'll work (reordering the subsetted file with the new header doesn't).

1
Entering edit mode

See my cat solution in the comment below, it should work without needing to use ReorderSam.

1
Entering edit mode
4.4 years ago

The thing with reheadering files is that you have to be absolutely sure that you're only removing chromosomes from the end of the header and never from the beginning, otherwise you'll end up borking the file. My guess is that your sed command is wrong, so write the header to a file and manually look at it.

0
Entering edit mode

I tried saving the header separately, removing the lines I don't want (which are at the end of the header) and reheadering the file. It still gives me a truncated file.

0
Entering edit mode

Is the original file truncated before reheadering?

0
Entering edit mode

Nope, the original file is perfectly okay...

1
Entering edit mode

What happens if you cat fixed_header <(samtools view no-mouse.bam) | samtools view -bo no-mouse.rehead.bam -? That's a less efficient reheadering, but avoids issues due to swapping chromosome order.

0
Entering edit mode
4.4 years ago
Luciferine • 0

Thank you Devon, that indeed worked to find out the actual problem, which seems to be, that I have chimeric reads, where one read of a pair maps to a human chromosome and the other one to a mouse chromosome. These read pairs of course lack one read after filtering out all non-human mapped reads, leaving a human read with a reference to a non-existent mouse read. This causes problems in the following steps. I'm trying to get rid of those now.