Question: Merging SAM/BAM files
gravatar for Sammy
15 months ago by
Sammy10 wrote:

Hello. I am fairly new to bioinformatics. I want to merge quite a few BAM files resulted from TopHat alignments. So I can play with them easier (as I don't have a lot of storage space) i cut them by the region of interest. So now I only have around 10kb per BAM file. I am not sure how to correctly merge BAM files with samtools (in Linux). Do I have to change the headers? More specifically to remove the headers for which I don't have the alignment lines anymore? I noticed that I have the same headers before and after I cut them. Also, do I need to introduce an @RQ header before merging them, assuming that I don't have one?

rna-seq • 731 views
ADD COMMENTlink written 15 months ago by Sammy10

How did you cut the BAM files? Can you post the command that you used?

ADD REPLYlink written 15 months ago by Friederike5.9k

I used Galaxy for that so I avoid downloading them in my PC. I left the default options. I just introduced the region that I wanted.

ADD REPLYlink written 15 months ago by Sammy10

Which function did you use in Galaxy and with which settings?

ADD REPLYlink written 15 months ago by Friederike5.9k
samtools view -b -h my.bam RNAME chr11:1200000-1300000
ADD REPLYlink written 15 months ago by Sammy10

Just so you get consistent help you may want to post this over at which is the official support site for PSU Galaxy.

ADD REPLYlink modified 15 months ago • written 15 months ago by genomax87k

out of curiosity: why didn't you the "Merge BAM Files" function in Galaxy?

ADD REPLYlink written 15 months ago by Friederike5.9k

I have merged them in both galaxy and samtools. It worked in both cases. But this is only for two BAMs. I am going to have to work soon with hundreds of samples. Right now, I am doing investigations about merging to make sure that this is the appropriate next step. I want to calculate the coverage per locus from the resulting BAM, in the end, after some more downstream processing.

I still do not understand what is the exact link between merging and the headers in SAM and why some people choose to add/replace headers (like @RQ and @CO). I mean, I know that @RQ has to be unique but besides this?

ADD REPLYlink modified 15 months ago • written 15 months ago by Sammy10
gravatar for Friederike
15 months ago by
United States
Friederike5.9k wrote:

Alright, so what you're actually asking is background info about the SAM header.

Just to cover all the basics (you may be aware of them already, but bear with me), here's an image I often use when I teach, which shows a schema of a typical SAM/BAM file.

The header section and the alignment section are very different in terms of their content and their format as you can see.

  • the header contains information about how the alignment was generated and stored
  • every line belonging to the header section begins with @, followed by a "record type", such as SQ, followed by tag:value pairs where tag is a two-letter string (such as LN). Every record type has well defined tags that belong to it and every tag has a specific way in which its values are denoted. Take, for example, the record type SQ, which stands for "reference sequence dictionary" in SAM spec speak or "reference genome" in bioinfo terms
    • if you look up the SAM file specs, pages 3-5, you can see that for SQ the following tags are allowed: SN, LN, AH, AN, AS, DS, M5, SP, UR
      • all of this is optional

A typical entry for a hypothetical organism with 3 chromosomes of length 1000, 1500, and 3000, could be represented as follows in the header section:

@SQ SN:chr1 LN:1000
@SQ SN:chr2 LN:1500
@SQ SN:chr3 LN:3000

So, in summary:

  • the header is theoretically optional, but often the very basic information such as the lengths of the chromosomes of the reference genomes are required by downstream tools
  • EDIT following a comment by Genomax: if you decide to include a header with certain entries, such as SQ, there are tags that may be required for a properly formatted SAM/BAM file (those are marked by asterisks in the SAM specs)
  • the CO line is handy to keep track of the specific alignment command that was used to generate a BAM file -- if you're merging multiple BAM files, you either want to have multiple CO lines to indicate the differences between the commands that may have been used or you may just want to retain a single one if you used the same command for all the individual files or something else entirely - the choice is yours as to how much meta-data you want to keep in the header.

If you're just starting out you probably don't want to add your own custom-brewed entries to the header, I would recommend to use the one that contains the info that are relevant and correct for all the BAM files you're merging.

One more comment: I don't think you meant RQ, I assume you're referring to @RG. To find out more about the significance of that particular entry, you may find this biostars post helpful.

And one last question: Why do you want to merge the files in the first place?

ADD COMMENTlink modified 15 months ago • written 15 months ago by Friederike5.9k

Yes, I meant @RG. Thanks for the answers. Sounds like probably I would want the CO record type.

Right, so to answer your question: I want to merge them so I can have a single BAM file with all the reads there. After that, I am planning to use samtools to create another tab-delimited file (by extracting certain fields with samtools view from the BAM) with the number of reads per position (that comes out from the all 100 samples). Seems pretty straight forward but I actually haven't talked with anyone about this. Am I talking Sci-Fi?

I mean creating the tab-delimited file seems pretty simple. I just have to copy the fields in a .txt file. At this point I just have to learn how to create columns and rows and add them to the same .txt file from my BAM file.

ADD REPLYlink written 15 months ago by Sammy10

Why do you have separate BAM files to begin with?

Going from SAM to text file may not be problematic in a toy example, but if you have a typical NGS experiment with millions to hundred millions of reads, it can become computationally daunting. What is the final question that you'd like to address?

ADD REPLYlink written 15 months ago by Friederike5.9k

Because multiple alignments, multiple BAMs? No more questions. Thanks for your help.

ADD REPLYlink written 15 months ago by Sammy10

sure, but why don't you just do one alignment for all the fastq files, creating one bam file from the get-go? (not saying it's wrong the way you do it, just curious)

ADD REPLYlink written 15 months ago by Friederike5.9k

Huh. Probably because I don't think I know what you mean by that. Can you please give me more details? I am intrigued.

ADD REPLYlink written 15 months ago by Sammy10

I guess I don't know enough about your set-up and overall goal. Typically, there is little need to merge BAM files if you have access to the FASTQ files. For example, if you have multiple FASTQ files because your samples were multiplexed (sequenced across different lanes) and very deeply, then you could supply all FASTQ files belonging to the same sample to the alignment tool, generating one BAM file. So, how did you end up having multiple BAM files to begin with?

EDIT: I guess, I've done too much RNA-seq these days. I may very well be that only STAR accepts multiple FASTQ files, but BWA, for example, might not.

ADD REPLYlink modified 15 months ago • written 15 months ago by Friederike5.9k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 723 users visited in the last hour