I am currently working on next-gen sequencing data and I have recently complete my preprocessing pipeline. But there are some point I want to ask and get opinion whether I am going in write direction or not:
While converting from sam file to bam file I only take properly paired reads, my command line is as follows :
samtools view -S -@ 30 -M -f 0x02 -b input_sam -o input_bam
where -f stands for considering only properly paired reads. I have checked that how many reads I have missed (means 0x04,0x08 etc). Very small amount of reads I have missed i.e. the size of original sam file is 26 gb and there is another sam with which has all the reads excluding 0x02 (i.e. missing reads) is 152 mb in size and . So it is ok to not to take all the reads other than properly paired?
My post processing steps are as follows:
Sam to Bam conversion and take only properly paired reads
Sorting of bam file with sorting order "queryname" (because fixmate require sorted bam file)
fixmate using samtools
Sorting again with sorting order "coordinate" (because samtools rmdup requires coordinated sorted bam file)
Now the problem is till indel realignment is OK but after base quality score recalibration, I am always getting truncated file and EOF missing (I have checked this by the command
samtools view -c file.bam at every stage). Due to this later stages of my pipeline are affected. Surprisingly GATK is working with truncated bam file with some error at the last line. So what I am missing, I don't know. I am looking for advice for getting better performance.
EDIT: Sorry for incomplete post. I have completed this by adding last two steps of post processing. My apologies.
EDIT2: I am posting warning and errors (I recently found them)
While calculating recalibrating score and getting
.table file I am getting following warning:
WARN 04:28:08,395 IndexDictionaryUtils - Track knownSites doesn't have a sequence dictionary built in,skipping dictionary validation
While getting recalibrated bam I am getting following warning:
Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again'
While using that recalibrated bam file for variant calling using gatk haplotype caller I am getting the following error:
ERROR MESSAGE: File out_recalibrated_bam.bai is malformed: Premature end-of-file while reading BAM index file out_recalibrated_bam.bai It's likely that this file is truncated or corrupt -- Please try re-indexing the corresponding BAM file.
Any idea about those error messages.