Firstly, I am only a Bioinformatics student, I am sorry if what I am saying isn't clear. I was asked to downsample one FASTQ file (File 1) to the same number of reads than another sample's FASTQ file (File 2) (already done), and after that to downsample this FASTQ File 1 to the same number of bases than the FASTQ File 2. These files are from Illumina paired-end DNA sequencing runs on a NextSeq 550.
For the context, these two samples are from the same yeast strain but are from 2 different sequencing runs using 2 different library preparation kits. These data will be used to do a draft de novo assembly in order to compare which file, at equal coverage, generates the best assembly (and then which library preparation kit would be better for this case).
- File 1 (to downsample) : Read 1 = 909314392 bp / Read 2 = 908769646 bp
- File 2 (targeted # of bases) : Read 1 = 754861054 bp / Read 2 = 755899824 bp
I haven't seen as much topics about downsampling by number of bases than about downsampling by number of reads. I have found this topic about using BBmap's
reformat.sh to do it : Base count NOT read count - Downsampling to specific number of bases not reads?
In fact, I am kind of stuck because this is paired-end, I am not sure how to do it properly. I used reformat.sh with the sbt option and using read 1 and read 2 Fastq files (File 1) as input files but I can only give one number of bases to the sbt option, so I have choosen the File 2 Read 1 length = 754861054 bp :
reformat.sh in=File1_R1.fastq in2=File1_R2.fastq out=File1_R1_subsampled.fastq out2=File1_R1_subsampled.fastq sbt=754861054
It resulted in two output files of the half of 754861054 bp : File1_R1_subsampled = 377540241 bp / File1_R2_subsampled = 377320812 bp. I supposed that I had to give the double of 754861054 bp (1509722108) to the sbt option in order to obtain 2 files of approximately 754861054 bp length. So this is what I've tried next :
reformat.sh in=File1_R1.fastq in2=File1_R2.fastq out=File1_R1_subsampled.fastq out2=File1_R1_subsampled.fastq sbt=1509722108
Which gave me in the end : File1_R1_subsampled = 755092384 bp / File1_R2_subsampled = 754629644 bp. Which is better in term of number of bases but I am not sure if this is accurate. Since I've used this command with 2 input files,
reformat.sh must have understood that it were paired FASTQ files.
I have also tried to it by downsampling separately the R1 file then the R2 file but, even if it returned the right number of bases in the end, I guess that they won't be paired anymore and it won't be useful for the next step which is the assembly if it is the case...
Does anyone know if the method I have used (second code with option
sbt=1509722108) is a correct way to downsample paired-end FASTQ files with a precise number of bases or is it totally non-sense ?
I was told that I could do it manually but I am not sure how, is there any script permitting to remove X last bases from a FASTQ file (if this is how we do it manually, I am not sure about that neither) or to downsample by giving an exact number of bases ?
Thank you in advance, any help would really be appreciated.