Question

Removing or not removing the duplicates in .bam file

0

Entering edit mode

6.4 years ago

zizigolu ★ 4.4k

Hi,

Sorry I have a list of .bam files from WGS, maintainer says that the duplicates been marked but not removed, I tried picard for removing duplicated but I am getting error

Broadinstitute says You have to be around for a little while longer before you can post links. so I can not post my question there

[fi1d18@cyan02 fi1d18]$ picard MarkDuplicates I=/temp/hgig/fi1d18/1631_WTSI-OESO_005_a_DNA/mapped_sample/HUMAN_1000Genomes_hs37d5_genomic_WTSI-OESO_005_a_DNA.dupmarked.bam O=/temp/hgig/fi1d18/1631_WTSI-OESO_005_a_DNA/mapped_sample/HUMAN_1000Genomes_hs37d5_genomic_WTSI-OESO_005_a_DNA.dupmarked1.bam M= marked-dup-metrics.txt [Thu Mar 07 17:33:42 GMT 2019] picard.sam.markduplicates.MarkDuplicates INPUT=[/temp/hgig/fi1d18/1631_WTSI-OESO_005_a_DNA/mapped_sample/HUMAN_1000Genomes_hs37d5_genomic_WTSI-OESO_005_a_DNA.dupmarked.bam] OUTPUT=/temp/hgig/fi1d18/1631_WTSI-OESO_005_a_DNA/mapped_sample/HUMAN_1000Genomes_hs37d5_genomic_WTSI-OESO_005_a_DNA.dupmarked1.bam METRICS_FILE=marked-dup-metrics.txt MAX_SEQUENCES_FOR_DISK_READ_ENDS_MAP=50000 MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=8000 SORTING_COLLECTION_SIZE_RATIO=0.25 REMOVE_SEQUENCING_DUPLICATES=false TAGGING_POLICY=DontTag REMOVE_DUPLICATES=false ASSUME_SORTED=false DUPLICATE_SCORING_STRATEGY=SUM_OF_BASE_QUALITIES PROGRAM_RECORD_ID=MarkDuplicates PROGRAM_GROUP_NAME=MarkDuplicates READ_NAME_REGEX=<optimized capture="" of="" last="" three="" ':'="" separated="" fields="" as="" numeric="" values=""> OPTICAL_DUPLICATE_PIXEL_DISTANCE=100 VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json
[Thu Mar 07 17:33:42 GMT 2019] Executing as fi1d18@cyan02 on Linux 2.6.32-754.el6.x86_64 amd64; Java HotSpot(TM) 64-Bit Server VM 1.8.0_51-b16; Picard version: 2.8.3-SNAPSHOT
INFO 2019-03-07 17:33:42 MarkDuplicates Start of doWork freeMemory: 2012347496; totalMemory: 2027945984; maxMemory: 3817865216
INFO 2019-03-07 17:33:42 MarkDuplicates Reading input file and constructing read end information.
INFO 2019-03-07 17:33:42 MarkDuplicates Will retain up to 14684096 data points before spilling to disk.
WARNING: BAM index file /temp/hgig/fi1d18/1631_WTSI-OESO_005_a_DNA/mapped_sample/HUMAN_1000Genomes_hs37d5_genomic_WTSI-OESO_005_a_DNA.dupmarked.bam.bai is older than BAM /temp/hgig/fi1d18/1631_WTSI-OESO_005_a_DNA/mapped_sample/HUMAN_1000Genomes_hs37d5_genomic_WTSI-OESO_005_a_DNA.dupmarked.bam
[Thu Mar 07 17:33:42 GMT 2019] picard.sam.markduplicates.MarkDuplicates done. Elapsed time: 0.01 minutes.
Runtime.totalMemory()=2027945984
To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp
Exception in thread "main" htsjdk.samtools.SAMFormatException: SAM validation error: ERROR: Record 3752, Read name HX3_22030:3:2114:23155:23319, bin field of BAM record does not equal value computed based on alignment start and end, and length of sequence to which read is aligned
at htsjdk.samtools.SAMUtils.processValidationErrors(SAMUtils.java:448)
at htsjdk.samtools.BAMFileReader$BAMFileIterator.advance(BAMFileReader.java:665)
at htsjdk.samtools.BAMFileReader$BAMFileIterator.next(BAMFileReader.java:650)
at htsjdk.samtools.BAMFileReader$BAMFileIterator.next(BAMFileReader.java:620)
at htsjdk.samtools.SamReader$AssertingIterator.next(SamReader.java:569)
at htsjdk.samtools.SamReader$AssertingIterator.next(SamReader.java:543)
at picard.sam.markduplicates.MarkDuplicates.buildSortedReadEndLists(MarkDuplicates.java:438)
at picard.sam.markduplicates.MarkDuplicates.doWork(MarkDuplicates.java:222)
at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:205)
at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:94)
at picard.cmdline.PicardCommandLine.main(PicardCommandLine.java:104)
[fi1d18@cyan02 fi1d18]$

How I could know the duplicates already removed and I am trying non sense because I don't know what this error says at all

picard GATK RNA-Seq WGS • 2.4k views

ADD COMMENT • link 6.4 years ago by zizigolu ★ 4.4k

0

Entering edit mode

File /home/local/software/picard-tools/2.8.3/reference.dict not found

ADD REPLY • link 6.4 years ago by WouterDeCoster 48k

0

Entering edit mode

But reference.dict supposed to be my output by this command :(

ADD REPLY • link 6.4 years ago by zizigolu ★ 4.4k

0

Entering edit mode

You are using the jar from /local/software/picard-tools/2.8.3/jarlib/picard.jar... does /home/local/software/picard-tools/2.8.3/ exist?

ADD REPLY • link 6.4 years ago by WouterDeCoster 48k

0

Entering edit mode

Yes it does however this was an intermediate step for using GATK

ADD REPLY • link 6.4 years ago by zizigolu ★ 4.4k

0

Entering edit mode

Error message is very explicit about what is wrong.

ADD REPLY • link 6.4 years ago by WouterDeCoster 48k

0

Entering edit mode

Please pick a more descriptive title for your question(s)!

ADD REPLY • link 6.4 years ago by WouterDeCoster 48k

0

Entering edit mode

Sorry I have a list of .bam files from WGS, maintainer says that the duplicates been marked but not removed

ADD REPLY • link 6.4 years ago by zizigolu ★ 4.4k

score 0 · Answer 1 · 2019-03-04

0

Entering edit mode

6.4 years ago

Asaf 10k

See here: https://gatkforums.broadinstitute.org/gatk/discussion/1601/how-can-i-prepare-a-fasta-file-to-use-as-reference

ADD COMMENT • link 6.4 years ago by Asaf 10k

score 0 · Answer 2 · 2019-03-04

0

Entering edit mode

6.4 years ago

zizigolu ★ 4.4k

The problem was I was using O while I must used OUTPUT :(

ADD COMMENT • link 6.4 years ago by zizigolu ★ 4.4k