Question

How to convert fastq to cram, and from cram back to fastq

4

Entering edit mode

5.6 years ago

Niek De Klein ★ 2.6k

I want to save my fastq files as CRAM files to save space. However, I do want to be able to get the reads back that I have in my fastq, in case I have to reallign in a different way in the future. Below I have the code that I use to do the conversions to BAM and back again. However, when I then check the number of lines in the original fastq with the converted fastq, the original is much smaller:

original: 17460203 (lines)
converted: 147610328 (lines)

The flagstat of the BAM file gives 71038660 + 2766504 in total (QC-passed reads + QC-failed reads) reads. I don't understand why the newly made has more reads, unless the conversion from BAM to fastq also adds reads that map on multiple locations as multiple reads. Should I remove all duplicated reads and then convert?

# align with STAR
STAR --readFilesIn fastq1.fq fastq2.fq \
          --genomeDir Gencode_human/ \
          --outFileNamePrefix example.

# fix mates with picard
java -jar $EBROOTPICARD/FixMateInformation.jar \
             INPUT=example.bam \
             OUTPUT=example.fixmates.bam \
             VALIDATION_STRINGENCY=LENIENT \
             CREATE_INDEX=true \
             SORT_ORDER=coordinate


# convert to cram with scramble from io_lib
scramble \
  -I bam \
  -O cram \
  -r Gencode_human/GRCh38.p5.genome.fa \
  example.fixmates.bam \
  example.cram 

# convert back to bam 
scramble \
   -I cram \
  -O bam \
  -m \
  -r Gencode_human/GRCh38.p5.genome.fa \
  example.cram \
  example.bam 

#sort by name
  samtools sort \
    -n \
    -o example.sorted.bam \
     example.bam 

#convert to fastq
samtools fastq \
  -@ 4 \
  -1 converted.fastq1.fq \
  -2 converted.fastq2.fq \
  example.sorted.bam

cram fastq • 8.5k views

ADD COMMENT • link updated 5.6 years ago by finswimmer 16k • written 5.6 years ago by Niek De Klein ★ 2.6k

score 6 · Accepted Answer · 2018-08-29

6

Entering edit mode

5.6 years ago

finswimmer 16k

Hello,

I guess the problem is that you align your reads. I don't know how samtools fastq handles reads that map to multiple position. It could be that they will be also multiple times in the output.

Instead I would suggest this workflow:

1. Create a uBam file using FastqToSam

$ picard FastqToSam F1=fastq1.fastq.gz F2=fastq2.fastq.gz O=Sample1_unaligned.bam SM=Sample1

2. Convert to CRAM

$ samtools view -T genome.fa -C -o Sample1.cram Sample1_unaligned.bam

To get your fastq files back do:

$ samtools fastq -@4 -1 sample1_R1.fastq -2 sample1_R2.fastq Sample1.cram

fin swimmer

ADD COMMENT • link 5.6 years ago by finswimmer 16k

0

Entering edit mode

I like my solution above because I can be (quite) sure that my reads aren't processed in an unexpected way.

Nevertheless AFAIK the compression algorithm for CRAM depends on the given reference and the alignment information. As all of our reads are unmapped the compression isn't optimal.

What you could try is to filter out secondary aligned reads when converting to cram:

$ samtools view -F 256 -T genome.fa -C -o Sample1.cram Sample1_unaligned.bam

I haven't test it. So I'm not sure whether there is an unexpected behavior.

fin swimmer

ADD REPLY • link 5.6 years ago by finswimmer 16k

0

Entering edit mode

Thanks, I'm testing this now and seeing how it compares to removing secondary aligned reads

ADD REPLY • link 5.6 years ago by Niek De Klein ★ 2.6k

0

Entering edit mode

Fine, please report your result then.

ADD REPLY • link 5.6 years ago by finswimmer 16k

0

Entering edit mode

I made a stupid mistake, was counting lines of the gzipped fastq file and comparing to an unzipped fastq file. After fixing that there was still a small difference, as STAR by default doesn't put unmapped reads in the output file, so had to --outSAMunmapped Within option.

After that I got the correct number of lines both with the method I posted above as well as your method. The CRAM file after alignment is 1/2 the size of the CRAM file from the unalinged BAM.

ADD REPLY • link 5.6 years ago by Niek De Klein ★ 2.6k

0

Entering edit mode

Hello,

Apologies but what does the -@4 means in your last line please?

Many thanks

ADD REPLY • link 5.2 years ago by Aurelie MLB ▴ 360

0

Entering edit mode

The -@ 4 option means to use 4 additional processing threads when decoding the CRAM file. It won't change the result, just speed it up on multi-core systems.

ADD REPLY • link 5.1 years ago by Fwip ▴ 500

0

Entering edit mode

Thank you very much !

ADD REPLY • link 5.1 years ago by Aurelie MLB ▴ 360

0

Entering edit mode

You could add SORT_ORDER=unsorted to the picard command since fastq files (if they come right from the sequencer) are already name-sorted, and by default picard would sort it again. Picard can also write to stdout if you use O=/dev/stdout so one could pipe it directly to samtools.

ADD REPLY • link 4.0 years ago by ATpoint 81k