Unable to overcome troubles in whole exome analysis
1
0
Entering edit mode
25 days ago
vinayjrao ▴ 200

Hi, I am working on whole exome sequencing data whose pipeline has been standardized in the lab as such -

Align with bwa-mem --> SortSam based on coordinates with Picard --> MarkDuplicates --> Remove sequencing duplicates --> Variant calling --> BQSR --> Variant calling --> Annotate with annovar

When I started working on this (on our institutional server), I had not created a java temp directory and it was working fine, but over time I started receiving Exception in thread "main" htsjdk.samtools.util.RuntimeIOException: java.io.IOException: No space left on device' 'Caused by: java.io.IOException: No space left on device errors, because of which I created the java temp directory. Upon doing so I am receiving the following error -

Exception in thread "main" htsjdk.samtools.SAMException: Value was put into PairInfoMap more than once.  1: null:A01223:29:HY2N7DMXX:2:2306:22562:23375
at htsjdk.samtools.CoordinateSortedPairInfoMap.remove(CoordinateSortedPairInfoMap.java:86)
at picard.sam.markduplicates.MarkDuplicates.doWork(MarkDuplicates.java:232)
at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:282)
at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:98)
at picard.cmdline.PicardCommandLine.main(PicardCommandLine.java:108)


I tried working on it by adding read groups, fixing mate pair information, removing duplicates with tools apart from Picard, but nothing seems to be helping. I tried to remove the java temp directory, but I received Exception in thread "main" htsjdk.samtools.util.RuntimeIOException: java.io.IOException: No space left on device' 'Caused by: java.io.IOException: No space left on device yet again. I am looking for a solution as early as possible.

• Vinay
exome mark picard seq duplicates • 573 views
1
Entering edit mode
0
Entering edit mode

Pierre, I also tried the solutions in these, and none of them seemed to work for me.The errors continued to stay. Trying multiple solutions, none of which worked led me to hypothesize that maybe the java temp directory was obstructing my analysis pipeline.

0
Entering edit mode

Pierre, there was also a comment here that suggested me to grep A01223:29:HY2N7DMXX:2:2306:22562:23375 from my fastq file, and when I tried it, I found it twice in both of my fastq files. The commenter also linked me to two other posts -

Out Of Disk Space With Picard Tools ? and Picard: Value was put into PairInfoMap more than once. Even bwa enabled -M. In the second post, the observation was similar, and you had suggested to remove the duplicates. Could you please advice me on how to do that?

Thank you

Edit: I could use uniq to remove duplicate entries from a regular fastq file, but I have gz files

0
Entering edit mode

what is the exact picard command line.

0
Entering edit mode

Hi Pierre,

Thank you for your response. The command used to run Picard is -

java -jar /path/to/picard/picard.jar MarkDuplicates INPUT=/path/to/bam/input.bam OUTPUT=/path/to/bam/input_marked.bam METRICS_FILE=/path/to/bam/input_metrics.txt

java -jar /path/to/picard/picard.jar MarkDuplicates REMOVE_SEQUENCING_DUPLICATES=true INPUT=/path/to/bam/input_marked.bam OUTPUT=/path/to/bam/input_deduplicated.bam METRICS_FILE=/path/to/bam/input_metrics.txt

1
Entering edit mode
24 days ago
vinayjrao ▴ 200

Ok, so I received my solution with a little more keywords in my google search. I ended up finding this post from SEQanswers, where a user named davstern had suggested the use of seqkit rmdup. I tried running this locally on my laptop with 8GB RAM and it took around 5 minutes on each fastq.gz file and was able to successfully remove 30-45 million reads from each file. Before trying this, I also tried running dedupe.sh from the BBMap package locally on the same laptop, but it was an extremely time and memory intensive task. After obtaining the trimmed fastq.gz files, I am now able to successfully run the pipeline.

A big thank you to Pierre for his suggestion on removing the duplicate entries on the fastq files

TL;DR: If running into the error, please grep the read identifier from your fastq files, and if you get multiple hits in your file (multiple matches in forward and/or multiple hits in reverse read), then you will have to de-duplicate your fastq using seqkit rmdup before aligning it to the genome

0
Entering edit mode

it took around 5 minutes on each fastq.gz file and was able to successfully remove 30-45 million reads from each file.

While you have marked this as solved it is unclear as to why you ended up with duplicate sequence entries in your ORIGINAL files. That is not normal and something you should have checked the cause of.

0
Entering edit mode

I was thinking the same.

Could you please post identical records here? Two duplicate records would suffice. This might be a bigger problem.

0
Entering edit mode

I have deleted the raw files from my server since my issue seemed to have been solved, but I will post them as soon as possible

0
Entering edit mode

Hi GenoMax,

The raw files were provided to us by our collaborators, and we will discuss this with them whenever we get the opportunity