Question: Truncated Bam Error
0
gravatar for vivekruhela
6 months ago by
vivekruhela10
vivekruhela10 wrote:

Hi,

I am currently working on next-gen sequencing data and I have recently complete my preprocessing pipeline. But there are some point I want to ask and get opinion whether I am going in write direction or not:

  1. While converting from sam file to bam file I only take properly paired reads, my command line is as follows :

    samtools view -S -@ 30 -M -f 0x02 -b input_sam -o input_bam

where -f stands for considering only properly paired reads. I have checked that how many reads I have missed (means 0x04,0x08 etc). Very small amount of reads I have missed i.e. the size of original sam file is 26 gb and there is another sam with which has all the reads excluding 0x02 (i.e. missing reads) is 152 mb in size and . So it is ok to not to take all the reads other than properly paired?

  1. My post processing steps are as follows:

    Sam to Bam conversion and take only properly paired reads

    Bam Validation

    Sorting of bam file with sorting order "queryname" (because fixmate require sorted bam file)

    fixmate using samtools

    Sorting again with sorting order "coordinate" (because samtools rmdup requires coordinated sorted bam file)

    Remove duplicates

    Indel Realignment

    BQSR

Now the problem is till indel realignment is OK but after base quality score recalibration, I am always getting truncated file and EOF missing (I have checked this by the command samtools view -c file.bam at every stage). Due to this later stages of my pipeline are affected. Surprisingly GATK is working with truncated bam file with some error at the last line. So what I am missing, I don't know. I am looking for advice for getting better performance.

EDIT: Sorry for incomplete post. I have completed this by adding last two steps of post processing. My apologies.

EDIT2: I am posting warning and errors (I recently found them)

While calculating recalibrating score and getting .table file I am getting following warning:

WARN  04:28:08,395 IndexDictionaryUtils - Track knownSites doesn't have a
sequence dictionary built in,skipping dictionary validation

While getting recalibrated bam I am getting following warning:

Failed to write core dump. Core dumps have been disabled. To enable core dumping,
try "ulimit -c unlimited" before starting Java again'

While using that recalibrated bam file for variant calling using gatk haplotype caller I am getting the following error:

ERROR MESSAGE: File out_recalibrated_bam.bai is malformed: Premature end-of-file while
reading BAM index file out_recalibrated_bam.bai It's likely that this file is truncated or corrupt -- 
Please try re-indexing the corresponding BAM file.

Any idea about those error messages.

Thanks

ADD COMMENTlink modified 6 months ago • written 6 months ago by vivekruhela10

What version of samtools?

ADD REPLYlink written 6 months ago by h.mon20k

I am using Samtools-1.7

ADD REPLYlink written 6 months ago by vivekruhela10

Can you check to see if the solutions provided in this thread help: How to systematically check if a bam file is truncated

ADD REPLYlink written 6 months ago by genomax57k

Thanks for reply. I have checked the link sent by you. This link will help us if we don't know which bam file is truncated or eof missing. Well, I have done that by both ways means 'samtools view -c' and 'samtools quickcheck' . I know that which file is truncated. So my question is all the steps of post processing are working fine except the last one. Why? And how to correct the error. I have also checked 'tail out.bam | hexdump -C ' to check 28 byte code for rog and unfortunately i did not find it. So how to deal with this error. Thanks.

ADD REPLYlink written 6 months ago by vivekruhela10
samtools view -S -@ 30 -M -f 0x02 -b input_sam -o input_bam
  

You need -h, otherwise you bam won't have the headers.

ADD REPLYlink written 6 months ago by h.mon20k

You mean that all of the error are due to this (means my bam files doesn't have header). Please clarify.

ADD REPLYlink written 6 months ago by vivekruhela10

I tried adding -h also. Same error is coming. Till indel realignment, BAM file is OK but after BAM recalibration, EOF is missing and file is truncated. Let me show the exact error message by samtools view -c recalibrated_out.bam : [W::bam_hdr_read] EOF marker is absent. The input is probably truncated [E::bgzf_read] Read block operation failed with error -1 after 8 of 32 bytes [main_samview] truncated file.

ADD REPLYlink written 6 months ago by vivekruhela10
1
gravatar for vivekruhela
6 months ago by
vivekruhela10
vivekruhela10 wrote:

After a lot of research, finally my problem is solved. The reason of error is as follows:

I am using "The Genome Analysis Toolkit (GATK) v3.8-0-ge9d806836"

And the command line are as follows : For table file:

java -Xms32g -Djar.io.tmpdir=/tmp -jar GenomeAnalysisTK.jar -T BaseRecalibrator -R reference.fastq -I indel_realignment_outfile.bam -knownsites All_20170710.vcf.gz -o outfile.table

For recalibration of bam:

java -Xms32g -jar GenomeAnalysisTK.jar -T PrintReads -R reference.fastq -I indel_realignment_outfile.bam -BQSR outfile.table -o outflie_recalibrated.bam

GATK 3.8 has a bug for memory allocation due to old Intel GKL. Intel GKL is updated in the latest version 3.8-1 and GATK 4.0 releases so they don't have this bug.

So I removed Xms32g and then it is working fine.

ADD COMMENTlink modified 6 months ago • written 6 months ago by vivekruhela10

It is odd that the bug bit you with only one file. Thanks for posting the answer to provide closure.

ADD REPLYlink written 6 months ago by genomax57k

I was also thinking the same.

ADD REPLYlink written 6 months ago by vivekruhela10
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1751 users visited in the last hour