I'm working with a pipeline where I delete FASTQ files after alignment to save space, so it's important that I can recover them exactly from the resulting CRAM files if needed.
The issue: my current workflow loses part of the original FASTQ header, specifically this section:
<read>:<is filtered>:<control number>:<sample number>
Here’s an example of a typical FASTQ field 1 line:
@EAS139:136:FC706VJ:2:2104:15343:197393 1:Y:18:ATCACG
I align reads per lane like so:
bwa mem \
-K 100000000 \
-Y \
-R "@RG\\tID:FC706VJ:2\\tSM:ATCACG\\tPL:ILLUMINA\\tPU:FC706VJ.2.ATCACG" \
{input.ref} \
{input.fastq1} \
{input.fastq2}
After alignment and CRAM generation, I recover FASTQs using:
samtools sort -n \
-@ 24 \
--reference $ref \
-T $tmp \
-o $queryname_sorted_cram \
$cram_path
samtools fastq -@ 4 -n \
-1 >(gzip > $r1_fastq) \
--reference $ref \
-2 >(gzip > $r2_fastq) \
-0 /dev/null -s /dev/null \
$queryname_sorted_cram
But the recovered reads look like this:
@EAS139:136:FC706VJ:2:2104:15343:197393
The trailing read info (1:Y:18:ATCACG
) is missing from field 1. I believe this information isn't preserved in the same format in the alignment output.
A workaround I'm considering is to collapse the two fields of the header before alignment using sed:
zcat {input.fastq1} | sed 's/ /|/' # turns the header into one field
So it becomes:
@EAS139:136:FC706VJ:2:2104:15343:197393|1:Y:18:ATCACG
This seems hacky though and I am afraid of running that sed command across all my fastq files and not sure if this will break downstream tools
Then you can use
reformat.sh
from BBMap suite with optionunderscore=t
to convert white space in read names to an underscore.Most aligners drop characters after the first white space when reporting alignments.
bbmap.sh
is one of the few (perhaps only) aligner that will preserve the entire read name in SAM/BAM output.That said information in
1:Y:18:ATCACG
is rarely going to be needed ... unless you MUST recreate the fastq header that existed in the original file (or the index association with the sample name some how got lost, which would be a big problem).Are you never going to need to submit this data to a resource like SRA? Or do you have a copy available and this deletion is just within the scope of your pipeline. Deleting primary data source is inviting trouble (if you don't have a backup).