Question: How Can I Get These Information About Bam Files?
5
gravatar for Biomed
3.4 years ago by
Biomed3.0k
Bethesda, MD, USA
Biomed3.0k wrote:
  1. I have three bam files, each bam file contains data from a sequencing lane. These three lanes represent the whole exome sequence of a single patient. If I had not known that these three bam files belong to the same sequencing run, is there a way to figure out that these files are from the same study and from different lanes?

Another way of asking the same question lets assume I were given only two of these files. How would I figure out that the third one is missing?

  1. How can I understand if the bam files contained aligned reads or unaligned reads?

  2. Do I need to merge these before I do any analysis like aligning, variant calling etc?

Thank you

ADD COMMENTlink modified 23 months ago by Keith James5.4k • written 3.4 years ago by Biomed3.0k

For 3 - BAM files are (generally) post-alignment

ADD REPLYlink written 3.4 years ago by Aaron Statham1.1k

Thanks, but to make sure that they are aligned do I need to convert to SAM and check the flags?

ADD REPLYlink written 3.4 years ago by Biomed3.0k
6
gravatar for Keith James
3.4 years ago by
Keith James5.4k
UK
Keith James5.4k wrote:

"These three lanes represent the whole exome sequence of a single patient. If I had not known that these three bam files belong to the same sequencing run, is there a way to figure out that these files are from the same study and from different lanes?"

The short answer to your first question is that there is no way instrinsic to BAM format data that allows you to be sure of deriving this information. Your rephrasing actually asks a slightly different, but related and equally important question. The answer to both is that you can only hope that the data providers followed a good scientific record-keeping regime such as MINSEQE outside of the BAM files.

BAM file headers are not sufficiently structured to represent an experimental design. The headers may contain "read group" records which, if present, must contain a "sample name". What is a valid "sample name" is not specified. If your 3 lanes are a single sample split into 3 lanes at the point of loading onto the flowcell(s), then they will probably have the same "sample name". There are also optional "library" and "description" fields that may be present in a "read group" record, which may tell you something. Also the sequencing platform (e.g. Illumina) and platform unit (e.g. lane) fields may tell you something, as might the date of sequencing.

Unfortunately, most BAM headers are optional and IMO their fields are too vaguely defined to be very useful. They are particularly difficult to use computationally, effectively being free text.

"How can I understand if the bam files contained aligned reads or unaligned reads?"

They may contain both. Each alignment record contains a flag field which is an integer. This is interpreted in its binary representation, with each bit having a different meaning. There is a bit to indicate that the query read is mapped and a bit to indicate that its mate is mapped (PacBio reads will cause problems!). You will need to scan the file to count the different flags.

Some sequencing centres use BAM files for all unaligned reads because they contain a superset of the data found in Fastq files.

"Do I need to merge these before I do any analysis like aligning, variant calling etc?"

Not necessarily. In fact, we often split BAM files into many parts to speed alignment by mapping them in parallel where appropriate e.g. when using BWA. Then we might merge them afterwards. It depends on the software you are using.

ADD COMMENTlink modified 3.4 years ago • written 3.4 years ago by Keith James5.4k

Thank you for your helpful answer.

ADD REPLYlink written 3.4 years ago by Biomed3.0k
3
gravatar for Pierre Lindenbaum
3.4 years ago by
France
Pierre Lindenbaum58k wrote:

1) ?

2) export BAM to SAM using samtools view and check the flag (see http://picard.sourceforge.net/explain-flags.html )

3) use samtools merge

ADD COMMENTlink written 3.4 years ago by Pierre Lindenbaum58k

I edited the question to the best of my ability. I hope it is more clear now.

ADD REPLYlink written 3.4 years ago by Biomed3.0k

Pierre, thanks for your answer. I understand that it is not possible to do this without converting to sam and looking at the flags. Also I assume I have to merge the files into a single bam for all downstream analysis.

ADD REPLYlink written 3.4 years ago by Biomed3.0k

No, you can work directly on the BAM file e.g. with Picard (see http://picard.sourceforge.net)

ADD REPLYlink written 3.4 years ago by Keith James5.4k
Please log in to add an answer.

Help
Access
  • RSS
  • Stats
  • API

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.0.0
Traffic: 690 users visited in the last hour