Question: How to convert fastq to cram, and from cram back to fastq
1
gravatar for Niek De Klein
9 months ago by
Niek De Klein2.5k
Netherlands
Niek De Klein2.5k wrote:

I want to save my fastq files as CRAM files to save space. However, I do want to be able to get the reads back that I have in my fastq, in case I have to reallign in a different way in the future. Below I have the code that I use to do the conversions to BAM and back again. However, when I then check the number of lines in the original fastq with the converted fastq, the original is much smaller:

original: 17460203 (lines)
converted: 147610328 (lines)

The flagstat of the BAM file gives 71038660 + 2766504 in total (QC-passed reads + QC-failed reads) reads. I don't understand why the newly made has more reads, unless the conversion from BAM to fastq also adds reads that map on multiple locations as multiple reads. Should I remove all duplicated reads and then convert?

# align with STAR
STAR --readFilesIn fastq1.fq fastq2.fq \
          --genomeDir Gencode_human/ \
          --outFileNamePrefix example.

# fix mates with picard
java -jar $EBROOTPICARD/FixMateInformation.jar \
             INPUT=example.bam \
             OUTPUT=example.fixmates.bam \
             VALIDATION_STRINGENCY=LENIENT \
             CREATE_INDEX=true \
             SORT_ORDER=coordinate


# convert to cram with scramble from io_lib
scramble \
  -I bam \
  -O cram \
  -r Gencode_human/GRCh38.p5.genome.fa \
  example.fixmates.bam \
  example.cram 

# convert back to bam 
scramble \
   -I cram \
  -O bam \
  -m \
  -r Gencode_human/GRCh38.p5.genome.fa \
  example.cram \
  example.bam 

#sort by name
  samtools sort \
    -n \
    -o example.sorted.bam \
     example.bam 

#convert to fastq
samtools fastq \
  -@ 4 \
  -1 converted.fastq1.fq \
  -2 converted.fastq2.fq \
  example.sorted.bam
cram fastq • 1.5k views
ADD COMMENTlink modified 9 months ago by finswimmer11k • written 9 months ago by Niek De Klein2.5k
2
gravatar for finswimmer
9 months ago by
finswimmer11k
Germany
finswimmer11k wrote:

Hello,

I guess the problem is that you align your reads. I don't know how samtools fastq handles reads that map to multiple position. It could be that they will be also multiple times in the output.

Instead I would suggest this workflow:

1. Create a uBam file using FastqToSam

$ picard FastqToSam F1=fastq1.fastq.gz F2=fastq2.fastq.gz O=Sample1_unaligned.bam SM=Sample1

2. Convert to CRAM

$ samtools view -T genome.fa -C -o Sample1.cram Sample1_unaligned.bam

To get your fastq files back do:

$ samtools fastq -@4 -1 sample1_R1.fastq -2 sample1_R2.fastq Sample1.cram

fin swimmer

ADD COMMENTlink modified 9 months ago • written 9 months ago by finswimmer11k

I like my solution above because I can be (quite) sure that my reads aren't processed in an unexpected way.

Nevertheless AFAIK the compression algorithm for CRAM depends on the given reference and the alignment information. As all of our reads are unmapped the compression isn't optimal.

What you could try is to filter out secondary aligned reads when converting to cram:

$ samtools view -F 256 -T genome.fa -C -o Sample1.cram Sample1_unaligned.bam

I haven't test it. So I'm not sure whether there is an unexpected behavior.

fin swimmer

ADD REPLYlink modified 9 months ago • written 9 months ago by finswimmer11k

Thanks, I'm testing this now and seeing how it compares to removing secondary aligned reads

ADD REPLYlink written 9 months ago by Niek De Klein2.5k

Fine, please report your result then.

ADD REPLYlink written 9 months ago by finswimmer11k

I made a stupid mistake, was counting lines of the gzipped fastq file and comparing to an unzipped fastq file. After fixing that there was still a small difference, as STAR by default doesn't put unmapped reads in the output file, so had to --outSAMunmapped Within option.

After that I got the correct number of lines both with the method I posted above as well as your method. The CRAM file after alignment is 1/2 the size of the CRAM file from the unalinged BAM.

ADD REPLYlink written 9 months ago by Niek De Klein2.5k

Hello,

Apologies but what does the -@4 means in your last line please?

Many thanks

ADD REPLYlink written 4 months ago by Aurelie MLB320

The -@ 4 option means to use 4 additional processing threads when decoding the CRAM file. It won't change the result, just speed it up on multi-core systems.

ADD REPLYlink written 3 months ago by Fwip490

Thank you very much !

ADD REPLYlink written 3 months ago by Aurelie MLB320
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 882 users visited in the last hour