I want to save my fastq files as CRAM files to save space. However, I do want to be able to get the reads back that I have in my fastq, in case I have to reallign in a different way in the future. Below I have the code that I use to do the conversions to BAM and back again. However, when I then check the number of lines in the original fastq with the converted fastq, the original is much smaller:
original: 17460203 (lines)
converted: 147610328 (lines)
The flagstat of the BAM file gives 71038660 + 2766504 in total (QC-passed reads + QC-failed reads) reads. I don't understand why the newly made has more reads, unless the conversion from BAM to fastq also adds reads that map on multiple locations as multiple reads. Should I remove all duplicated reads and then convert?
# align with STAR STAR --readFilesIn fastq1.fq fastq2.fq \ --genomeDir Gencode_human/ \ --outFileNamePrefix example. # fix mates with picard java -jar $EBROOTPICARD/FixMateInformation.jar \ INPUT=example.bam \ OUTPUT=example.fixmates.bam \ VALIDATION_STRINGENCY=LENIENT \ CREATE_INDEX=true \ SORT_ORDER=coordinate # convert to cram with scramble from io_lib scramble \ -I bam \ -O cram \ -r Gencode_human/GRCh38.p5.genome.fa \ example.fixmates.bam \ example.cram # convert back to bam scramble \ -I cram \ -O bam \ -m \ -r Gencode_human/GRCh38.p5.genome.fa \ example.cram \ example.bam #sort by name samtools sort \ -n \ -o example.sorted.bam \ example.bam #convert to fastq samtools fastq \ -@ 4 \ -1 converted.fastq1.fq \ -2 converted.fastq2.fq \ example.sorted.bam