How to convert fastq to cram, and from cram back to fastq
1
4
Entering edit mode
3.7 years ago
Niek De Klein ★ 2.6k

I want to save my fastq files as CRAM files to save space. However, I do want to be able to get the reads back that I have in my fastq, in case I have to reallign in a different way in the future. Below I have the code that I use to do the conversions to BAM and back again. However, when I then check the number of lines in the original fastq with the converted fastq, the original is much smaller:

original: 17460203 (lines)
converted: 147610328 (lines)

The flagstat of the BAM file gives 71038660 + 2766504 in total (QC-passed reads + QC-failed reads) reads. I don't understand why the newly made has more reads, unless the conversion from BAM to fastq also adds reads that map on multiple locations as multiple reads. Should I remove all duplicated reads and then convert?

# align with STAR
--genomeDir Gencode_human/ \
--outFileNamePrefix example.

# fix mates with picard
java -jar EBROOTPICARD/FixMateInformation.jar \ INPUT=example.bam \ OUTPUT=example.fixmates.bam \ VALIDATION_STRINGENCY=LENIENT \ CREATE_INDEX=true \ SORT_ORDER=coordinate # convert to cram with scramble from io_lib scramble \ -I bam \ -O cram \ -r Gencode_human/GRCh38.p5.genome.fa \ example.fixmates.bam \ example.cram # convert back to bam scramble \ -I cram \ -O bam \ -m \ -r Gencode_human/GRCh38.p5.genome.fa \ example.cram \ example.bam #sort by name samtools sort \ -n \ -o example.sorted.bam \ example.bam #convert to fastq samtools fastq \ -@ 4 \ -1 converted.fastq1.fq \ -2 converted.fastq2.fq \ example.sorted.bam  cram fastq • 6.3k views ADD COMMENT 5 Entering edit mode 3.7 years ago Hello, I guess the problem is that you align your reads. I don't know how samtools fastq handles reads that map to multiple position. It could be that they will be also multiple times in the output. Instead I would suggest this workflow: 1. Create a uBam file using FastqToSam  picard FastqToSam F1=fastq1.fastq.gz F2=fastq2.fastq.gz O=Sample1_unaligned.bam SM=Sample1


2. Convert to CRAM

samtools view -T genome.fa -C -o Sample1.cram Sample1_unaligned.bam  To get your fastq files back do:  samtools fastq -@4 -1 sample1_R1.fastq -2 sample1_R2.fastq Sample1.cram


fin swimmer

0
Entering edit mode

I like my solution above because I can be (quite) sure that my reads aren't processed in an unexpected way.

Nevertheless AFAIK the compression algorithm for CRAM depends on the given reference and the alignment information. As all of our reads are unmapped the compression isn't optimal.

What you could try is to filter out secondary aligned reads when converting to cram:

\$ samtools view -F 256 -T genome.fa -C -o Sample1.cram Sample1_unaligned.bam


I haven't test it. So I'm not sure whether there is an unexpected behavior.

fin swimmer

0
Entering edit mode

Thanks, I'm testing this now and seeing how it compares to removing secondary aligned reads

0
Entering edit mode

0
Entering edit mode

I made a stupid mistake, was counting lines of the gzipped fastq file and comparing to an unzipped fastq file. After fixing that there was still a small difference, as STAR by default doesn't put unmapped reads in the output file, so had to --outSAMunmapped Within option.

After that I got the correct number of lines both with the method I posted above as well as your method. The CRAM file after alignment is 1/2 the size of the CRAM file from the unalinged BAM.

0
Entering edit mode

Hello,

Apologies but what does the -@4 means in your last line please?

Many thanks

0
Entering edit mode

The -@ 4 option means to use 4 additional processing threads when decoding the CRAM file. It won't change the result, just speed it up on multi-core systems.

0
Entering edit mode

Thank you very much !

0
Entering edit mode

You could add SORT_ORDER=unsorted to the picard command since fastq files (if they come right from the sequencer) are already name-sorted, and by default picard would sort it again. Picard can also write to stdout if you use O=/dev/stdout so one could pipe it directly to samtools.