I have been given .BAM files that were generated a while back (4 years ago) and were tasked with unmapping & remapping the reads.
I attempted my usual approach of extracting paired-end reads from .BAM using bedtools' bamToFastq which generated two empty fastq files. I went ahead and attempted to use other available tools (samtools, Picard). With samtools' bam2fq, I was able to extract all of the reads into a single fastq file. However, I noticed that header information of the reads were missing labels that differentiates forward from reverse reads (e.g. /1, /2).
Below is what one of the lines in .BAM file looks like before I attempt to unmap the reads:
@RG ID:ZFXSC.IonXpress_008.1 PL:IONTORRENT PU:PGM/P1.1.17/IonXpress_008 LB:hg19/IonXpress_008 FO:TACGTACGTCTGAGCATCGATCGATGTACAGCTACGTACGTCTGAGCATCGATCGATGTACAGCTACGTACGTCTGAGCATCGATCGATGTACAGCTACGTACGTCTGAGCATCGATCGATG TACAGCTACGTACGTCTGAGCATCGATCGATGTACAGCTACGTACGTCTGAGCATCGATCGATGTACAGCTACGTACGTCTGAGCATCGATCGATGTACAGCTACGTACGTCTGAGCATCGATCGATGTACAGCTACGTACGTCTGAGCATCGATCGATGTACAGCTACGTACGTCTGAGCATCGATCGATGTACAGCTACGTACGTCTGAGCATCGATCGATGTACAGCTACGTAC GTCTGAGCATCGATCGATGTACAGCTACGTACGTCTGAGCATCGATCGATGTACAGCTACGTACGTCTGAGCATCGATCGATGTACAGCTACGTACGTCTGAGCATCGATCGATGTACAGCTACGTACGTCTGAGCATCGATCGATGTACAGCTACGTACG DT:2014-01-29T15:52:12-0500 SM:978R PG:tmap KS:TCAGTTCCGATAACGAT CN:TorrentServer/GTproton
And below is one of the reads from unmapped reads using samtools:
@ZFXSC:08727:06569 CGTTTAAACTACATGCAGGAACAGCGAA + ::=:5:?5:555:<::990/)//4,,,)
From what I was able to gather, the current unmapped reads contain following information in the header:
Run ID: Every TS analysis gets a run ID, a 5-character string consisting of upper case letters and numbers, assigned. A reanalysis of a specific run will get a different run ID assigned. Example: 0JU8V.
SAM record (read) names: Read names are a combination of the run ID and the chip coordinates of the well that produced the read. The coordinate values are 5-digit numbers and are given in the order row and the column, separated by a colon. Example: 0JU8V:01308:00107.
However, they seem to be missing read group ID, defined as:
Read Group ID: For non-barcoded runs the read group ID is equal to the run ID. For barcoded runs it is a combination of the run ID and the barcode name, separated by a dot. Example: 0JU8V.IonXpress_001
I assumed that I should be able to determine which read belongs to one of the two paired-end read fastq files by looking at the read group ID, but the .BAM file has more than 2 read groups (95 if I counted correctly).
ZFXSC.IonXpress_008.[1,..., 29, 1A,..., 1Z, 2A,..., 2M, A,..., X ].
I attempted using samtools view -c -f 1 on the .BAM file and noticed that it returned 0, which I believe means that the .BAM file was generated from single-end reads.
However, I was told by the person that have given me the .BAM files that he/she believe .BAM files to be generated using paired-end reads. Is there anything else I could do to figure out what exactly is going on here?