Question

Confusion regarding manual inclusion of read group information from fastq files

0

Entering edit mode

2.4 years ago

Gene_MMP8 ▴ 240

I have recently received a collection of paired-end fastq files (WES) from our collaborators. I am following the GATK best practices workflow. I have completed the alignment, sorting&indexing step and generated a list of bam files. However, upon further inspection, I found out that the bam files do not have the RG tag that uniquely identifies each read in my analysis. I have found several resources online that talk about this issue and how to add this information manually. But all I have is a bunch of fastq files and I want to use the header information to assign the read groups myself. But this is what the headers look like:

Sample 1 - Read 1 - First 3 headers :

@NB501115:23:H3MJFBGX2:1:11101:3645:1046 1:N:0:CCGTGAGA  
@NB501115:23:H3MJFBGX2:1:11101:16971:1046 1:N:0:CCGTGAGA
@NB501115:23:H3MJFBGX2:1:11101:7432:1048 1:N:0:CCGTGAGA

Sample 1 - Read 2 - First 3 headers :

@NB501115:23:H3MJFBGX2:1:11101:3645:1046 2:N:0:CCGTGAGA
@NB501115:23:H3MJFBGX2:1:11101:16971:1046 2:N:0:CCGTGAGA
@NB501115:23:H3MJFBGX2:1:11101:7432:1048 2:N:0:CCGTGAGA

I did some digging and found out that this header is a typical output from Illumina's Casava 1.8 and this is the breakdown of the components of the header.

NB501115 - the unique instrument name  
23 - run id  
H3MJFBGX2 - flowcell id  
1 - flowcell lane  
11101 - the number within the flowcell lane  
3645 - x'-coordinate of the cluster within the tile  
1046 - y'-coordinate of the cluster within the tile  
1 - the member of a pair, 1 or 2 (paired-end or mate-pair reads only)  
N - Y if the read is filtered (did not pass), N otherwise  
0 - 0 when none of the control bits are on, otherwise it is an even number  
CCGTGAGA - index sequence

I am now following this solution to extract read group information from the fastq headers. The problem is I am unable to figure out what should be the SM-ID-PU tags. The unique portion of the read names that come after flow cell lane, and separated by colons, are tile number, x-coordinate of cluster and y-coordinate of cluster. Should I use that to construct the ID tag? SM information can be extracted from the file names. I am not sure is PU is mandatory.

readgroup sequencing bwa fatsq • 580 views

ADD COMMENT • link updated 2.4 years ago by Pierre Lindenbaum 161k • written 2.4 years ago by Gene_MMP8 ▴ 240