Question

FASTQ to BAM to CRAM to FASTQ

0

Entering edit mode

6 months ago

ManuelDB ▴ 80

My NGS bioinformatics analysis starts with an amplicon FASTQ file (only the R1). In my workflow, I finally created a BAM file. Then, I convert this BAM in CRAM for backup

 apptainer exec --bind "$ref_folder":"$ref_folder" "$samtools" samtools view \
                      -C -T $bwarefgenomepath \
                      -o ART03_FINAL.cram \
                      ART03_FINAL.bam

We will backup the CRAM file but I want to create a temporary FASTQ file from the CRAM to check if the original fastq and the new fastq are identical. In the conversion from CRAM to FASTQ, I have some issues.

I have this code:

 samtools fastq -T $bwarefgenomepath ART03_FINAL.cram > BACKUP.fastq
 gzip BACKUP.fastq

The BACKUP.fastq.gz is created but it is much smaller than the original

 103M Oct 17 20:21 ART03_S8_L001_R1_001.fastq.gz
3.7M Oct 19 17:58 BACKUP.fastq.gz
# or before compression 
440M Oct 19 17:58 BACKUP.fastq

To do this I use Slurn and I get this in the .log file

 Homo_sapiens_assembly38.fasta
[W::fastq_state_set] Bad tag format '/ma'; skipping option
[W::find_file_url] Failed to open reference "https://www.ebi.ac.uk/ena/cram/md5/cc044cc2256a1141212660fb07b6171e": Network is unreachable
[M::bam2fq_mainloop] discarded 0 singletons
[M::bam2fq_mainloop] processed 1603312 reads

Even if my BACKUP.fast is ok, I am not going to be able to compare the original and the new one via MD5, right??

Both fastq.gz have the same number of read

Samtools • 682 views

ADD COMMENT • link 6 months ago by ManuelDB ▴ 80

3

Entering edit mode

unless the original cram is sorted on query, samtools collate should be used before samtools fastq

ADD REPLY • link 6 months ago by Pierre Lindenbaum 161k

1

Entering edit mode

Beside sorting fastq files' gzip compression ratio might differ too. You can check it with gzip -l

Also fastq files id lines might differ too. You probably want -n to not write the "/1 /2" as well as some way to get the barcodes.

samtools fastq -T $bwarefgenomepath ART03_FINAL.cram > BACKUP.fastq

This would not work since -T is for TAGLIST parameter, you want to give reference with --reference.

ADD REPLY • link 6 months ago by barslmn ★ 2.1k

1

Entering edit mode

You could try samtools import to ingest the FASTQ and output to CRAM in a single step. It also adds a header comment to indicate the appropriate samtools fastq command to reverse the process. Eg:

$ samtools import -1 SRR554369_1.fastq -2 SRR554369_2.fastq|head -3
@HD VN:1.6  SO:unsorted GO:query
@CO Reverse with: samtools fastq -1 R1.fastq -2 R2.fastq 
SRR554369.1 77  *   0   0   *   *   0   0   TAACGAAACCGGCATGGAACCTCCTGTTNTTGTGGCGGGGGGGGGTAGGCTAGAGAGCTGGCGCCCGCAACCCGGGGGGGGACTTTNNNGGGGGGGGCGG    IIHHIIIIIIIIIHIDEGHIIIIIIFEE#CEEA###################################################################

Note however this isn't always completely round-trippable. That's a good idea and something we should consider, but it's complicated as FASTQ is such a bad format! It has a myriad of different vaguarities and brokenness to contend with, that it'd need a plethora of extra aux tags to document how to reverse the process. However there is still merit in doing this as a means for removing the FASTQ from local storage and reducing the number of copies of data on disk.

ADD REPLY • link 6 months ago by jkbonfield ★ 1.2k

0

Entering edit mode

Thanks all for your help! I finally found SamFormatConverter and SamToFastq (both Picard) the right tools to do this. See code here Different number of reads when converting data from FASTQ to BAM and CRAM to FASTQ

ADD REPLY • link 6 months ago by ManuelDB ▴ 80