Cannot find Output files after applying Markduplicates with picard tools
0
0
Entering edit mode
10 weeks ago

I've some sorted bam files and i wanted to mark the duplicate reads using MarkDuplicate in picard tool: all files are present in a directory named AlignmentOfTrimmed_Sam_Files the whole path for these files is defined below, and this is my current working directory, After running this command several times which takes an hour each time and ith minor changes each time I was never able to find the output files

Any suggestions to help??
And thanks in advance

### Path of the directory where sorted bam files are located:

samfiles_dir = '/media/phmagdy/TOSHIBA_EXT/PhD_Data_Analysis/group3/AlignmentOfTrimmed_Sam_Files/'

### Loop over sorted bam files and markduplicates using picard tools 

for file in os.listdir(samfiles_dir):
    if file.endswith('sorted.bam'):
        inputfile = os.path.join(samfiles_dir,file)
        fileBasename = '_'.join(os.path.basename(file).rsplit('_',4)[0:3])
        !java  -Xmx20g -jar {picard_path}/picard.jar MarkDuplicates --INPUT {inputfile} \
        --OUTPUT {fileBasename}.markdup.bam \
        --METRICS_FILE {fileBasename}.metrics.txt

here is a part of the output :

MarkDuplicates starts at 2022-09-18 16:07:52.296874
16:07:53.413 INFO  NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/home/phmagdy/miniconda3/envs/Jhm/share/picard-2.27.4-0/picard.jar!/com/intel/gkl/native/libgkl_compression.so
[Sun Sep 18 16:07:53 EET 2022] MarkDuplicates --INPUT /media/phmagdy/TOSHIBA_EXT/PhD_Data_Analysis/group3/AlignmentOfTrimmed_Sam_Files/S000021_S5424Nr_7_sorted.bam --OUTPUT S000021_S5424Nr_7.markdup.bam --METRICS_FILE S000021_S5424Nr_7.metrics.txt --MAX_SEQUENCES_FOR_DISK_READ_ENDS_MAP 50000 --MAX_FILE_HANDLES_FOR_READ_ENDS_MAP 8000 --SORTING_COLLECTION_SIZE_RATIO 0.25 --TAG_DUPLICATE_SET_MEMBERS false --REMOVE_SEQUENCING_DUPLICATES false --TAGGING_POLICY DontTag --CLEAR_DT true --DUPLEX_UMI false --FLOW_MODE false --FLOW_QUALITY_SUM_STRATEGY false --USE_END_IN_UNPAIRED_READS false --USE_UNPAIRED_CLIPPED_END false --UNPAIRED_END_UNCERTAINTY 0 --FLOW_SKIP_FIRST_N_FLOWS 0 --FLOW_Q_IS_KNOWN_END false --FLOW_EFFECTIVE_QUALITY_THRESHOLD 15 --ADD_PG_TAG_TO_READS true --REMOVE_DUPLICATES false --ASSUME_SORTED false --DUPLICATE_SCORING_STRATEGY SUM_OF_BASE_QUALITIES --PROGRAM_RECORD_ID MarkDuplicates --PROGRAM_GROUP_NAME MarkDuplicates --READ_NAME_REGEX <optimized capture of last three ':' separated fields as numeric values> --OPTICAL_DUPLICATE_PIXEL_DISTANCE 100 --MAX_OPTICAL_DUPLICATE_SET_SIZE 300000 --VERBOSITY INFO --QUIET false --VALIDATION_STRINGENCY STRICT --COMPRESSION_LEVEL 5 --MAX_RECORDS_IN_RAM 500000 --CREATE_INDEX false --CREATE_MD5_FILE false --GA4GH_CLIENT_SECRETS client_secrets.json --help false --version false --showHidden false --USE_JDK_DEFLATER false --USE_JDK_INFLATER false
[Sun Sep 18 16:07:53 EET 2022] Executing as phmagdy@ubuntu on Linux 5.15.0-46-generic amd64; OpenJDK 64-Bit Server VM 1.8.0_112-b16; Deflater: Intel; Inflater: Intel; Provider GCS is not available; Picard version: Version:2.27.4-SNAPSHOT
INFO    2022-09-18 16:07:53 MarkDuplicates  Start of doWork freeMemory: 208248760; totalMemory: 221249536; maxMemory: 19088801792
INFO    2022-09-18 16:07:53 MarkDuplicates  Reading input file and constructing read end information.
INFO    2022-09-18 16:07:53 MarkDuplicates  Will retain up to 69162325 data points before spilling to disk.
INFO    2022-09-18 16:08:00 MarkDuplicates  Read     1,000,000 records.  Elapsed time: 00:00:06s.  Time for last 1,000,000:    6s.  Last read position: chr1:16,264,133
INFO    2022-09-18 16:08:00 MarkDuplicates  Tracking 3899 as yet unmatched pairs. 422 records in RAM.
INFO    2022-09-18 16:08:05 MarkDuplicates  Read     2,000,000 records.  Elapsed time: 00:00:11s.  Time for last
MarkDuplicates tools picard • 471 views
ADD COMMENT
0
Entering edit mode

You probably did not write the python code yourself otherwise you would be familiar with this. Code above is using

--OUTPUT {fileBasename}.markdup.bam  

does your account have permission to write to the same directory the input files are in? If not you should change that option to a directory where you can write files.

In addition, the error is probably at the end of the log file (rather than the start that you posted above). Check in the last 25 lines and show us the error, if there is one.

ADD REPLY
0
Entering edit mode

actually i tried creating another folder with the name MarkDup inside the above directory to direct the output files to :

if os.path.exists ('/media/phmagdy/TOSHIBA_EXT/PhD-Data Analysis/group3/AlignmentOfTrimmed_Sam_Files/MarkDup') == False:
    os.makedirs('/media/phmagdy/TOSHIBA_EXT/PhD-Data Analysis/group3/AlignmentOfTrimmed_Sam_Files/MarkDup')

here what the codes looked like :

samfiles_dir = '/media/phmagdy/TOSHIBA_EXT/PhD_Data_Analysis/group3/AlignmentOfTrimmed_Sam_Files/'
for file in os.listdir(samfiles_dir):
    if file.endswith('sorted.bam'):
        inputfile = os.path.join(samfiles_dir,file)
        fileBasename = '_'.join(os.path.basename(file).rsplit('_',4)[0:3])
        !java  -Xmx20g -jar {picard_path}/picard.jar MarkDuplicates --INPUT {inputfile} \
        --OUTPUT /media/phmagdy/TOSHIBA_EXT/PhD_Data_Analysis/group3/AlignmentOfTrimmed_Sam_Files/MarkDup/{fileBasename}.markdup.bam \
        --METRICS_FILE /media/phmagdy/TOSHIBA_EXT/PhD_Data_Analysis/group3/AlignmentOfTrimmed_Sam_Files/MarkDup/{fileBasename}.metrics.txt

and also I was not able to find the output files

N.B. there was no error at the end of the execution after almost one hour ... and here are the last few lines

INFO    2022-09-18 14:58:24 MarkDuplicates  Read    41,000,000 records.  Elapsed time: 00:03:19s.  Time for last 1,000,000:    3s.  Last read position: chr8:107,782,217
INFO    2022-09-18 14:58:24 MarkDuplicates  Tracking 114840 as yet unmatched pairs. 2544 records in RAM.
INFO    2022-09-18 14:59:01 MarkDuplicates  Read    42,000,000 records.  Elapsed time: 00:03:57s.  Time for last 1,000,000:   37s.  Last read position: chr9:2,718,932
INFO    2022-09-18 14:59:01 MarkDuplicates  Tracking 114824 as yet unmatched pairs. 9314 records in RAM.
INFO    2022-09-18 14:59:57 MarkDuplicates  Read    43,000,000 records.  Elapsed time: 00:04:52s.  Time for last 1,000,000:   55s.  Last read position: chr9:66,499,605
INFO    2022-09-18 14:59:57 MarkDuplicates  Tracking 114507 as yet unmatched pairs. 6658 records in RAM.
INFO    2022-09-18 15:00:02 MarkDuplicates  Read    44,000,000 records.  Elapsed time: 00:04:57s.  Time for last 1,000,000:    4s.  Last read position: chr9:107,578,518
INFO    2022-09-18 15:00:02 MarkDuplicates  Tracking 113906 as yet unmatched pairs. 3393 records in RAM.
Time elapsed = 0:57:49.228557
ADD REPLY
0
Entering edit mode

That is odd, if you can create that directory then you can write files to it. Did the metrics.txt file also not get created? Can you add --VERBOSITY DEBUG and run to see if we get more detail in the log.

ADD REPLY
0
Entering edit mode

The created folder where the output was meant to be is completely empty after the execution and neither the .markdup.bam files nor the metrics.txt files were created >>>>

I also tried adding-VERBOSITY DEBUG and I think nothing different from before happened here is part of the output :

MarkDuplicates starts at 2022-09-19 00:00:32.372156
00:00:33.242 INFO  NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/home/phmagdy/miniconda3/envs/Jhm/share/picard-2.27.4-0/picard.jar!/com/intel/gkl/native/libgkl_compression.so
[Mon Sep 19 00:00:33 EET 2022] MarkDuplicates --INPUT /home/phmagdy/phd/Group_3/bam_files/S000021_S5424Nr_2_sorted.bam --OUTPUT /home/phmagdy/phd/Group_3/bam_files/MarkDup/S000021_S5424Nr_2.markdup.bam --METRICS_FILE /home/phmagdy/phd/Group_3/bam_files/MarkDup/S000021_S5424Nr_2.metrics.txt --VERBOSITY DEBUG --MAX_SEQUENCES_FOR_DISK_READ_ENDS_MAP 50000 --MAX_FILE_HANDLES_FOR_READ_ENDS_MAP 8000 --SORTING_COLLECTION_SIZE_RATIO 0.25 --TAG_DUPLICATE_SET_MEMBERS false --REMOVE_SEQUENCING_DUPLICATES false --TAGGING_POLICY DontTag --CLEAR_DT true --DUPLEX_UMI false --FLOW_MODE false --FLOW_QUALITY_SUM_STRATEGY false --USE_END_IN_UNPAIRED_READS false --USE_UNPAIRED_CLIPPED_END false --UNPAIRED_END_UNCERTAINTY 0 --FLOW_SKIP_FIRST_N_FLOWS 0 --FLOW_Q_IS_KNOWN_END false --FLOW_EFFECTIVE_QUALITY_THRESHOLD 15 --ADD_PG_TAG_TO_READS true --REMOVE_DUPLICATES false --ASSUME_SORTED false --DUPLICATE_SCORING_STRATEGY SUM_OF_BASE_QUALITIES --PROGRAM_RECORD_ID MarkDuplicates --PROGRAM_GROUP_NAME MarkDuplicates --READ_NAME_REGEX <optimized capture of last three ':' separated fields as numeric values> --OPTICAL_DUPLICATE_PIXEL_DISTANCE 100 --MAX_OPTICAL_DUPLICATE_SET_SIZE 300000 --QUIET false --VALIDATION_STRINGENCY STRICT --COMPRESSION_LEVEL 5 --MAX_RECORDS_IN_RAM 500000 --CREATE_INDEX false --CREATE_MD5_FILE false --GA4GH_CLIENT_SECRETS client_secrets.json --help false --version false --showHidden false --USE_JDK_DEFLATER false --USE_JDK_INFLATER false
[Mon Sep 19 00:00:33 EET 2022] Executing as phmagdy@ubuntu on Linux 5.15.0-46-generic amd64; OpenJDK 64-Bit Server VM 1.8.0_112-b16; Deflater: Intel; Inflater: Intel; Provider GCS is not available; Picard version: Version:2.27.4-SNAPSHOT
INFO    2022-09-19 00:00:33 MarkDuplicates  Start of doWork freeMemory: 208248040; totalMemory: 221249536; maxMemory: 19088801792
INFO    2022-09-19 00:00:33 MarkDuplicates  Reading input file and constructing read end information.
INFO    2022-09-19 00:00:33 MarkDuplicates  Will retain up to 69162325 data points before spilling to disk.
INFO    2022-09-19 00:00:43 MarkDuplicates  Read     1,000,000 records.  Elapsed time: 00:00:07s.  Time for last 1,000,000:    7s.  Last read position: chr1:16,969,312
INFO    2022-09-19 00:00:43 MarkDuplicates  Tracking 3362 as yet unmatched pairs. 570 records in RAM.
INFO    2022-09-19 00:00:49 MarkDuplicates  Read     2,000,000 records.  Elapsed time: 00:00:13s.  Time for last 1,000,000:    5s.  Last read position: chr1:38,265,674
INFO    2022-09-19 00:00:49 MarkDuplicates  Tracking 6477 as yet unmatched pairs. 679 records in RAM.
INFO    2022-09-19 00:00:55 MarkDuplicates  Read     3,000,000 records.  Elapsed time: 00:00:19s.  Time for last 1,000,000:    6s.  Last read position: chr1:78,041,750
INFO    2022-09-19 00:00:55 MarkDuplicates  Tracking 10422 as yet unmatched pairs. 784 records in RAM.
INFO    2022-09-19 00:01:04 MarkDuplicates  Read     4,000,000 records.  Elapsed time: 00:00:28s.  Time for last 1,000,000:    9s.  Last read position: chr1:144,521,647
INFO    2022-09-19 00:01:04 MarkDuplicates  Tracking 15894 as yet unmatched pairs. 1062 records in RAM.
INFO    2022-09-19 00:01:10 MarkDuplicates  Read     5,000,000 records.  Elapsed time: 00:00:33s.  Time for last 1,000,000:    5s.  Last read position: chr1:154,493,799
INFO    2022-09-19 00:01:10 MarkDuplicates  Tracking 18001 as yet unmatched pairs. 687 records in RAM.
INFO    2022-09-19 00:01:15 MarkDuplicates  Read     6,000,000 records.  Elapsed time: 00:00:39s.  Time for last 1,000,000:    5s.  Last read position: chr1:181,686,317
INFO    2022-09-19 00:01:15 MarkDuplicates  Tracking 21124 as yet unmatched pairs. 546 records in RAM.
INFO    2022-09-19 00:01:20 MarkDuplicates  Read     7,000,000 records.  Elapsed time: 00:00:43s.  Time for last 1,000,000:    4s.  Last read position: chr1:226,027,072
INFO    2022-09-19 00:01:20 MarkDuplicates  Tracking 24755 as yet unmatched pairs. 219 records in RAM.
ADD REPLY
0
Entering edit mode

Are your files name or query sorted?

ADD REPLY
0
Entering edit mode

Yes each bam file has its sorted version and the index beside it ... here is a screen shot of what the files look like enter image description here

ADD REPLY
0
Entering edit mode

They are coordinate sorted

ADD REPLY

Login before adding your answer.

Traffic: 834 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6