How to preserve full FASTQ header through alignment and CRAM recovery
1
1
Entering edit mode
3 months ago
curious ▴ 890

I'm working with a pipeline where I delete FASTQ files after alignment to save space, so it's important that I can recover them exactly from the resulting CRAM files if needed.

The issue: my current workflow loses part of the original FASTQ header, specifically this section:

<read>:<is filtered>:<control number>:<sample number>

Here’s an example of a typical FASTQ field 1 line:

@EAS139:136:FC706VJ:2:2104:15343:197393 1:Y:18:ATCACG

I align reads per lane like so:

bwa mem \
  -K 100000000 \
  -Y \
  -R "@RG\\tID:FC706VJ:2\\tSM:ATCACG\\tPL:ILLUMINA\\tPU:FC706VJ.2.ATCACG" \
  {input.ref} \
  {input.fastq1} \
  {input.fastq2}

After alignment and CRAM generation, I recover FASTQs using:

samtools sort -n \
  -@ 24 \
  --reference $ref \
  -T $tmp \
  -o $queryname_sorted_cram \
  $cram_path


samtools fastq -@ 4 -n \
  -1 >(gzip > $r1_fastq) \
    --reference $ref \
  -2 >(gzip > $r2_fastq) \
  -0 /dev/null -s /dev/null \
  $queryname_sorted_cram

But the recovered reads look like this:

@EAS139:136:FC706VJ:2:2104:15343:197393

The trailing read info (1:Y:18:ATCACG) is missing from field 1. I believe this information isn't preserved in the same format in the alignment output.

A workaround I'm considering is to collapse the two fields of the header before alignment using sed:

zcat {input.fastq1} | sed 's/ /|/'  # turns the header into one field

So it becomes:

@EAS139:136:FC706VJ:2:2104:15343:197393|1:Y:18:ATCACG

This seems hacky though and I am afraid of running that sed command across all my fastq files and not sure if this will break downstream tools

bam fastq cram • 1.0k views
ADD COMMENT
0
Entering edit mode

This seems hacky though

Then you can use reformat.sh from BBMap suite with option underscore=t to convert white space in read names to an underscore.

reformat.sh -Xmx4g in1=R1.fq.gz in2=R2.fq.gz out1=Mod_R1.fq.gz out2=Mod_R2.fq.gz underscore=t

Most aligners drop characters after the first white space when reporting alignments. bbmap.sh is one of the few (perhaps only) aligner that will preserve the entire read name in SAM/BAM output.

That said information in 1:Y:18:ATCACG is rarely going to be needed ... unless you MUST recreate the fastq header that existed in the original file (or the index association with the sample name some how got lost, which would be a big problem).

ADD REPLY
0
Entering edit mode

I delete FASTQ files after alignment to save space

Are you never going to need to submit this data to a resource like SRA? Or do you have a copy available and this deletion is just within the scope of your pipeline. Deleting primary data source is inviting trouble (if you don't have a backup).

ADD REPLY
3
Entering edit mode
3 months ago

: samtools sort -n | samtools fastq should be replaced with samtools collate| samtools fastq . Furthermore, samtools fastq is able to gzip the output fastq .

and

gunzip -c ${fastq} | tr " " "|" 

and I don't know anyway to keep the original names but you can find the information about your fields https://en.wikipedia.org/wiki/FASTQ_format

    1   the member of a pair, 1 or 2 (paired-end or mate-pair reads only)  => found in the SAM flag
    Y   Y if the read is filtered (did not pass), N otherwise => found in the sam flag
    18  0 when none of the control bits are on, otherwise it is an even number => in the sam flag
    ATCACG  index sequence => can be set using the READ GROUPS
ADD COMMENT
0
Entering edit mode

Thanks. Do people just not worry about keeping individual read barcodes with the alignment? ~97% of my barcodes in the fastq file are identical but ~3% have some minor base mismatches so there is some information there.

I am also worried about having more that 8 ":" delimited fields in the QNAME of the cram following concatenation with "|", eg:

@EAS139:136:FC706VJ:2:2104:15343:197393|1:Y:18:ATCACG

wouldn't this break tools that expect only 8 fields?

ADD REPLY
0
Entering edit mode

the only tool I know which uses this information is Markduplicate but you can always set a regex to set the fields: https://gatk.broadinstitute.org/hc/en-us/articles/360037052812-MarkDuplicates-Picard#--READ_NAME_REGEX

Regular expression that can be used to parse read names in the incoming SAM file. Read names are parsed to extract three variables: tile/region, x coordinate and y coordinate. These values are used to estimate the rate of optical duplication in order to give a more accurate estimated library size. ...

ADD REPLY

Login before adding your answer.

Traffic: 4594 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6