Question: Problems with picard markduplicates
0
gravatar for Cecelia
10 months ago by
Cecelia20
Cecelia20 wrote:

Hi, I was running picard markduplicates with a few bam files.

java -jar /sw/bioinfo/picard/2.20.4/rackham/picard.jar MarkDuplicates INPUT=sorted.bam OUTPUT=md.bam METRICS_FILE=duplicate.txt READ_NAME_REGEX=null REMOVE_DUPLICATES=true CREATE_INDEX=true

And I got no output file and message like this:

INFO    2019-11-20 02:53:04 MarkDuplicates  Start of doWork freeMemory: 2037715440; totalMemory: 2058354688; maxMemory: 28631367680
INFO    2019-11-20 02:53:04 MarkDuplicates  Reading input file and constructing read end information.
INFO    2019-11-20 02:53:04 MarkDuplicates  Will retain up to 103736839 data points before spilling to disk.
[Wed Nov 20 02:53:05 CET 2019] picard.sam.markduplicates.MarkDuplicates done. Elapsed time: 0.02 minutes.
Runtime.totalMemory()=2058354688
To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp
Exception in thread "main" htsjdk.samtools.SAMException: Sequence name 'scaffold1,8899378,f8056Z8899378' doesn't match regex: '[0-9A-Za-z!#$%&+./:;?@^_|~-][0-9A-Za-z!#$%&*+./:;=?@^_|~-]*' 
    at htsjdk.samtools.SAMSequenceRecord.validateSequenceName(SAMSequenceRecord.java:211)
    at htsjdk.samtools.SAMSequenceRecord.<init>(SAMSequenceRecord.java:94)
    at htsjdk.samtools.SAMTextHeaderCodec.parseSQLine(SAMTextHeaderCodec.java:224)
    at htsjdk.samtools.SAMTextHeaderCodec.decode(SAMTextHeaderCodec.java:114)
    at htsjdk.samtools.BAMFileReader.readHeader(BAMFileReader.java:704)
    at htsjdk.samtools.BAMFileReader.<init>(BAMFileReader.java:298)
    at htsjdk.samtools.BAMFileReader.<init>(BAMFileReader.java:176)
    at htsjdk.samtools.SamReaderFactory$SamReaderFactoryImpl.open(SamReaderFactory.java:396)
    at picard.sam.markduplicates.util.AbstractMarkDuplicatesCommandLineProgram.openInputs(AbstractMarkDuplicatesCommandLineProgram.java:220)
    at picard.sam.markduplicates.MarkDuplicates.buildSortedReadEndLists(MarkDuplicates.java:528)
    at picard.sam.markduplicates.MarkDuplicates.doWork(MarkDuplicates.java:257)
    at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:305)
    at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:103)
    at picard.cmdline.PicardCommandLine.main(PicardCommandLine.java:113)

The reference genome is the output from LINKS v1.8.7. The header of each contig should look like this: (scalfold,size,contig infomation)

>scaffold1,8899378,f8056Z8899378
>scaffold2,7251368,f8058Z7239915k15a0.13m100_f1079z11453
>scaffold3,6336565,f8055Z6291785k21a0.14m100_r10570z44780

If it is the problem with the header, how should I change it without losing information?

I read in this post that setting READ_NAME_REGEX=null could solve the problem but did not work in my case.
https://sourceforge.net/p/samtools/mailman/message/32614448/

Any comments or suggestion will be appreciated.

ADD COMMENTlink modified 9 months ago by Biostar ♦♦ 20 • written 10 months ago by Cecelia20

Try adding a comma to the second part of the regex (full regex would be '[0-9A-Za-z!#$%&+./:;?@^_|~-][0-9A-Za-z!#$%&*+./:;=?@^_|~,-]*' and using that as the READ_NAME_REGEX

ADD REPLYlink written 10 months ago by RamRS30k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 854 users visited in the last hour