I have a tumor sample where the per base sequence quality is very low for both reads (using FATSQC). So this is what I did:
Converted the tumor.bam to fastq. (Read1.fq and Read2.fq)
bedtools bamtofq -i tumor.bam -fq Read1.fq -fq2 Read2.fq
Now the FASTQC report was very poor. So I decided trimming. Here's what I used:
sickle pe -f Read1.fq -r Read2.fq -t sanger -o qtrim1.fq -p qtrim2.fq -s single.fq -q 20
Now I have three files: qtrim1.fq , qtrim2.fq and single.fq. All of good quality. Now I merge them.
cat qtrim1.fq qtrim2.fq single,fq > merged.fq
Finally convert to bam:
java -jar picard.jar FastqToSam merged.fq O=trimmed_bam.bam SM=tumor
The resulting bam file is very small(~460Mb) as compared to the original(~80GB). I am losing a hell lot of information doing this quality trimming. Any suggestions to get past this?
Kindly go through the steps and suggest anything wrong or something you would have done differently.
Number of reads in original bam: 1034838478
Number of reads in processed bam : 8053958