Question

bwa mem -R (RGline) format for DNBseq fastq

1

Entering edit mode

4.7 years ago

noodle ▴ 580

I have a WGS dataset I'd like to align with bwa but the fastq files are in DNBseq format (shown below) and I'm stuck on the syntax. Does anyone how to change the RGline “@RG\tID:**\tLB:**\tPL:ILLUMINA\tSM:**\tPU:BARCODE” part of BWA mem -aMp -t #ofCPUs ref.fa -R “@RG\tID:**\tLB:**\tPL:ILLUMINA\tSM:**\tPU:BARCODE” > output.sam to make these work? Thanks.

Here's the top line of a fastq file in DNBseq format

@V300020594L1C001R0010000024/1

*Edited for clarity.

DNBseq bwa bwa mem WGS fastq • 4.5k views

ADD COMMENT • link updated 3.7 years ago by xxa • 0 • written 4.7 years ago by noodle ▴ 580

1

Entering edit mode

You are referring to -R option which is for read groups.

-R STR Complete read group header line. ’\t’ can be used in STR and will be converted to a TAB in the output SAM. The read group ID will be attached to every read in the output. An example is ’@RG\tID:foo\tSM:bar’.

If you don't have multiple samples (or don't plan to use a program that requires read groups) you should be able to omit that option. Note: If your downstream analysis requires read groups then you would need to use read groups.

Can you show us original fastq headers in your data?

ADD REPLY • link 4.7 years ago by GenoMax 142k

0

Entering edit mode

I have multiple paired-end samples, which is why I was hoping to use -R. This is the header I was provided from the NGS company (BGI), which used DNBseq machines. The only documentation I could find from them shows the below.

FASTQ file sequenced by DNBseq.

@CL100072652L2C001R001_12/1
GCGACCCCAGGTCAGTCGGGACTACCCGCTGAAGTCGGAGGCCAAGCGGT
+

FFFCFFFFFFFFFDFEFFFFEFEF0FFFFEFFFFFFFEFFFFFECGFFFF

I'll try to parse in some colon delimiters and see if it helps(like this).

@V300020594:L1:C001:R001:0000024:1

Thanks.

ADD REPLY • link updated 4.7 years ago by GenoMax 142k • written 4.7 years ago by noodle ▴ 580

0

Entering edit mode

Take a look at this page to see if you can use some of the examples to construct an appropriate -R line for your samples.

Something like

@RG\tID:CL1000\tLB:Library_ID\tPL:HELICOS\tSM:SAMPLE_ID\tPU:Index_seq

This bioRxiv paper says that an accompanying repo has code for converting BGI formatted fastq headers to Illumina format. But I am not able to find a link for the GitHub repo. Take a look around to see if you can find it.

ADD REPLY • link 4.7 years ago by GenoMax 142k

0

Entering edit mode

@joe, I got Nebula (BGI/DNBseq) WGS results. I'm new to this. Can you please share your BWA MEM command parameters for alignment with DNBseq data that you used?

ADD REPLY • link 3.7 years ago by xxa • 0

0

Entering edit mode

DNBseq data should be no different than any other sequence data as in being sanger encoded fastq sequence. Use standard options to start the analysis.

ADD REPLY • link 3.7 years ago by GenoMax 142k