"These three lanes represent the whole exome sequence of a single patient. If I had not known that these three bam files belong to the same sequencing run, is there a way to figure out that these files are from the same study and from different lanes?"
The short answer to your first question is that there is no way instrinsic to BAM format data that allows you to be sure of deriving this information. Your rephrasing actually asks a slightly different, but related and equally important question. The answer to both is that you can only hope that the data providers followed a good scientific record-keeping regime such as MINSEQE outside of the BAM files.
BAM file headers are not sufficiently structured to represent an experimental design. The headers may contain "read group" records which, if present, must contain a "sample name". What is a valid "sample name" is not specified. If your 3 lanes are a single sample split into 3 lanes at the point of loading onto the flowcell(s), then they will probably have the same "sample name". There are also optional "library" and "description" fields that may be present in a "read group" record, which may tell you something. Also the sequencing platform (e.g. Illumina) and platform unit (e.g. lane) fields may tell you something, as might the date of sequencing.
Unfortunately, most BAM headers are optional and IMO their fields are too vaguely defined to be very useful. They are particularly difficult to use computationally, effectively being free text.
"How can I understand if the bam files contained aligned reads or unaligned reads?"
They may contain both. Each alignment record contains a flag field which is an integer. This is interpreted in its binary representation, with each bit having a different meaning. There is a bit to indicate that the query read is mapped and a bit to indicate that its mate is mapped (PacBio reads will cause problems!). You will need to scan the file to count the different flags.
Some sequencing centres use BAM files for all unaligned reads because they contain a superset of the data found in Fastq files.
"Do I need to merge these before I do any analysis like aligning, variant calling etc?"
Not necessarily. In fact, we often split BAM files into many parts to speed alignment by mapping them in parallel where appropriate e.g. when using BWA. Then we might merge them afterwards. It depends on the software you are using.
modified 4.2 years ago
4.2 years ago by
Keith James ♦ 5.6k