Unable to overcome troubles in whole exome analysis
1
0
Entering edit mode
2.6 years ago
vinayjrao ▴ 250

Hi, I am working on whole exome sequencing data whose pipeline has been standardized in the lab as such -

Align with bwa-mem --> SortSam based on coordinates with Picard --> MarkDuplicates --> Remove sequencing duplicates --> Variant calling --> BQSR --> Variant calling --> Annotate with annovar

When I started working on this (on our institutional server), I had not created a java temp directory and it was working fine, but over time I started receiving

Exception in thread "main" htsjdk.samtools.util.RuntimeIOException: java.io.IOException: No space left on device
Caused by: java.io.IOException: No space left on device

errors, because of which I created the java temp directory. Upon doing so I am receiving the following error -

Exception in thread "main" htsjdk.samtools.SAMException: Value was put into PairInfoMap more than once.  1: null:A01223:29:HY2N7DMXX:2:2306:22562:23375
    at htsjdk.samtools.CoordinateSortedPairInfoMap.ensureSequenceLoaded(CoordinateSortedPairInfoMap.java:133)
    at htsjdk.samtools.CoordinateSortedPairInfoMap.remove(CoordinateSortedPairInfoMap.java:86)
    at picard.sam.markduplicates.util.DiskBasedReadEndsForMarkDuplicatesMap.remove(DiskBasedReadEndsForMarkDuplicatesMap.java:61)
    at picard.sam.markduplicates.MarkDuplicates.buildSortedReadEndLists(MarkDuplicates.java:528)
    at picard.sam.markduplicates.MarkDuplicates.doWork(MarkDuplicates.java:232)
    at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:282)
    at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:98)
    at picard.cmdline.PicardCommandLine.main(PicardCommandLine.java:108)

I tried working on it by adding read groups, fixing mate pair information, removing duplicates with tools apart from Picard, but nothing seems to be helping. I tried to remove the java temp directory, but I received

Exception in thread "main" htsjdk.samtools.util.RuntimeIOException: java.io.IOException: No space left on device
Caused by: java.io.IOException: No space left on device

yet again. I am looking for a solution as early as possible.

Thanks in advance :)

-Vinay

exome-seq markduplicates picard • 2.0k views
ADD COMMENT
0
Entering edit mode

Pierre, I also tried the solutions in these, and none of them seemed to work for me.The errors continued to stay. Trying multiple solutions, none of which worked led me to hypothesize that maybe the java temp directory was obstructing my analysis pipeline.

ADD REPLY
0
Entering edit mode

Pierre, there was also a comment here that suggested me to grep A01223:29:HY2N7DMXX:2:2306:22562:23375 from my fastq file, and when I tried it, I found it twice in both of my fastq files. The commenter also linked me to two other posts -

Out Of Disk Space With Picard Tools ? and Picard: Value was put into PairInfoMap more than once. Even bwa enabled -M. In the second post, the observation was similar, and you had suggested to remove the duplicates. Could you please advice me on how to do that?

Thank you

Edit: I could use uniq to remove duplicate entries from a regular fastq file, but I have gz files

ADD REPLY
0
Entering edit mode

what is the exact picard command line.

ADD REPLY
0
Entering edit mode

Hi Pierre,

Thank you for your response. The command used to run Picard is -

java -jar /path/to/picard/picard.jar MarkDuplicates INPUT=/path/to/bam/input.bam OUTPUT=/path/to/bam/input_marked.bam METRICS_FILE=/path/to/bam/input_metrics.txt

java -jar /path/to/picard/picard.jar MarkDuplicates REMOVE_SEQUENCING_DUPLICATES=true INPUT=/path/to/bam/input_marked.bam OUTPUT=/path/to/bam/input_deduplicated.bam METRICS_FILE=/path/to/bam/input_metrics.txt

ADD REPLY
1
Entering edit mode
2.6 years ago
vinayjrao ▴ 250

Ok, so I received my solution with a little more keywords in my google search. I ended up finding this post from SEQanswers, where a user named davstern had suggested the use of seqkit rmdup. I tried running this locally on my laptop with 8GB RAM and it took around 5 minutes on each fastq.gz file and was able to successfully remove 30-45 million reads from each file. Before trying this, I also tried running dedupe.sh from the BBMap package locally on the same laptop, but it was an extremely time and memory intensive task. After obtaining the trimmed fastq.gz files, I am now able to successfully run the pipeline.

A big thank you to Pierre for his suggestion on removing the duplicate entries on the fastq files

TL;DR: If running into the error, please grep the read identifier from your fastq files, and if you get multiple hits in your file (multiple matches in forward and/or multiple hits in reverse read), then you will have to de-duplicate your fastq using seqkit rmdup before aligning it to the genome

ADD COMMENT
0
Entering edit mode

it took around 5 minutes on each fastq.gz file and was able to successfully remove 30-45 million reads from each file.

While you have marked this as solved it is unclear as to why you ended up with duplicate sequence entries in your ORIGINAL files. That is not normal and something you should have checked the cause of.

ADD REPLY
0
Entering edit mode

I was thinking the same.

@ vinayjrao

Could you please post identical records here? Two duplicate records would suffice. This might be a bigger problem.

ADD REPLY
0
Entering edit mode

Hi cpad0112,

I have deleted the raw files from my server since my issue seemed to have been solved, but I will post them as soon as possible

ADD REPLY
0
Entering edit mode

Hi GenoMax,

The raw files were provided to us by our collaborators, and we will discuss this with them whenever we get the opportunity

ADD REPLY

Login before adding your answer.

Traffic: 2733 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6