Problems with picard markduplicates
0
0
Entering edit mode
4.4 years ago
Cecelia ▴ 30

Hi, I was running picard markduplicates with a few bam files.

java -jar /sw/bioinfo/picard/2.20.4/rackham/picard.jar MarkDuplicates INPUT=sorted.bam OUTPUT=md.bam METRICS_FILE=duplicate.txt READ_NAME_REGEX=null REMOVE_DUPLICATES=true CREATE_INDEX=true

And I got no output file and message like this:

INFO    2019-11-20 02:53:04 MarkDuplicates  Start of doWork freeMemory: 2037715440; totalMemory: 2058354688; maxMemory: 28631367680
INFO    2019-11-20 02:53:04 MarkDuplicates  Reading input file and constructing read end information.
INFO    2019-11-20 02:53:04 MarkDuplicates  Will retain up to 103736839 data points before spilling to disk.
[Wed Nov 20 02:53:05 CET 2019] picard.sam.markduplicates.MarkDuplicates done. Elapsed time: 0.02 minutes.
Runtime.totalMemory()=2058354688
To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp
Exception in thread "main" htsjdk.samtools.SAMException: Sequence name 'scaffold1,8899378,f8056Z8899378' doesn't match regex: '[0-9A-Za-z!#$%&+./:;?@^_|~-][0-9A-Za-z!#$%&*+./:;=?@^_|~-]*' 
    at htsjdk.samtools.SAMSequenceRecord.validateSequenceName(SAMSequenceRecord.java:211)
    at htsjdk.samtools.SAMSequenceRecord.<init>(SAMSequenceRecord.java:94)
    at htsjdk.samtools.SAMTextHeaderCodec.parseSQLine(SAMTextHeaderCodec.java:224)
    at htsjdk.samtools.SAMTextHeaderCodec.decode(SAMTextHeaderCodec.java:114)
    at htsjdk.samtools.BAMFileReader.readHeader(BAMFileReader.java:704)
    at htsjdk.samtools.BAMFileReader.<init>(BAMFileReader.java:298)
    at htsjdk.samtools.BAMFileReader.<init>(BAMFileReader.java:176)
    at htsjdk.samtools.SamReaderFactory$SamReaderFactoryImpl.open(SamReaderFactory.java:396)
    at picard.sam.markduplicates.util.AbstractMarkDuplicatesCommandLineProgram.openInputs(AbstractMarkDuplicatesCommandLineProgram.java:220)
    at picard.sam.markduplicates.MarkDuplicates.buildSortedReadEndLists(MarkDuplicates.java:528)
    at picard.sam.markduplicates.MarkDuplicates.doWork(MarkDuplicates.java:257)
    at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:305)
    at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:103)
    at picard.cmdline.PicardCommandLine.main(PicardCommandLine.java:113)

The reference genome is the output from LINKS v1.8.7. The header of each contig should look like this: (scalfold,size,contig infomation)

>scaffold1,8899378,f8056Z8899378
>scaffold2,7251368,f8058Z7239915k15a0.13m100_f1079z11453
>scaffold3,6336565,f8055Z6291785k21a0.14m100_r10570z44780

If it is the problem with the header, how should I change it without losing information?

I read in this post that setting READ_NAME_REGEX=null could solve the problem but did not work in my case.
https://sourceforge.net/p/samtools/mailman/message/32614448/

Any comments or suggestion will be appreciated.

picard markduplicates samtools • 2.4k views
ADD COMMENT
0
Entering edit mode

Try adding a comma to the second part of the regex (full regex would be '[0-9A-Za-z!#$%&+./:;?@^_|~-][0-9A-Za-z!#$%&*+./:;=?@^_|~,-]*' and using that as the READ_NAME_REGEX

ADD REPLY

Login before adding your answer.

Traffic: 2394 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6