I'm got some low coverage skim-seq bam files (1x) and was doing qc on them and got some strange results. I ran Picard CollectInsertSizeMetrics. The sequencing was done by Illumina paired end and the orientation was be F-R as usual. But I got insert size histograms showing FR, TANDEM, and a mix of both FR and TANDEM. An example of the FR and TANDEM result is below:
I'm not sure how this is possible. The BAMS each had two fastq (R1 and R2). Is it possible they mapped R1 and R1 instead of R1 and R2? Or maybe a demultiplexing problem?
For the samples that have TANDEM in them I also noticed that none of them make it through Picard MarkDuplicates when I try to regenerate the BAMS using the proper R1 and R2 fastq. I get this error:
14:34:28.057 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/home/robert/tools/picard.jar!/com/intel/gkl/native/libgkl_compression.so
[Wed Dec 01 14:34:28 AST 2021] MarkDuplicates INPUT=[./CONTROL-2013-399-918-2_sorted.bam] OUTPUT=./CONTROL-2013-399-918-2_sorted_marked.bam METRICS_FILE=./CONTROL-2013-399-918-2_marked_dup_metrics.txt ASSUME_SORT_ORDER=coordinate MAX_SEQUENCES_FOR_DISK_READ_ENDS_MAP=50000 MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=8000 SORTING_COLLECTION_SIZE_RATIO=0.25 TAG_DUPLICATE_SET_MEMBERS=false REMOVE_SEQUENCING_DUPLICATES=false TAGGING_POLICY=DontTag CLEAR_DT=true DUPLEX_UMI=false ADD_PG_TAG_TO_READS=true REMOVE_DUPLICATES=false ASSUME_SORTED=false DUPLICATE_SCORING_STRATEGY=SUM_OF_BASE_QUALITIES PROGRAM_RECORD_ID=MarkDuplicates PROGRAM_GROUP_NAME=MarkDuplicates READ_NAME_REGEX=<optimized capture of last three ':' separated fields as numeric values> OPTICAL_DUPLICATE_PIXEL_DISTANCE=100 MAX_OPTICAL_DUPLICATE_SET_SIZE=300000 VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json USE_JDK_DEFLATER=false USE_JDK_INFLATER=false
[Wed Dec 01 14:34:28 AST 2021] Executing as robert@robert-ThinkStation-P340 on Linux 5.10.0-1051-oem amd64; OpenJDK 64-Bit Server VM 11.0.9.1-internal+0-adhoc..src; Deflater: Intel; Inflater: Intel; Provider GCS is not available; Picard version: 2.25.1
INFO 2021-12-01 14:34:28 MarkDuplicates Start of doWork freeMemory: 36229840; totalMemory: 41943040; maxMemory: 16785604608
INFO 2021-12-01 14:34:28 MarkDuplicates Reading input file and constructing read end information.
INFO 2021-12-01 14:34:28 MarkDuplicates Will retain up to 60817408 data points before spilling to disk.
INFO 2021-12-01 14:34:30 MarkDuplicates Read 1,000,000 records. Elapsed time: 00:00:02s. Time for last 1,000,000: 2s. Last read position: NC_048565.1:43,242,107
INFO 2021-12-01 14:34:30 MarkDuplicates Tracking 41067 as yet unmatched pairs. 536 records in RAM.
INFO 2021-12-01 14:34:33 MarkDuplicates Read 2,000,000 records. Elapsed time: 00:00:04s. Time for last 1,000,000: 2s. Last read position: NC_048565.1:88,419,569
INFO 2021-12-01 14:34:33 MarkDuplicates Tracking 61376 as yet unmatched pairs. 352 records in RAM.
[Wed Dec 01 14:34:33 AST 2021] picard.sam.markduplicates.MarkDuplicates done. Elapsed time: 0.09 minutes.
Runtime.totalMemory()=2361393152
To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp
Exception in thread "main" htsjdk.samtools.SAMException: Value was put into PairInfoMap more than once. 1: RGA01180:82:HTT3KDSX2:1:2207:19298:32346
at htsjdk.samtools.CoordinateSortedPairInfoMap.ensureSequenceLoaded(CoordinateSortedPairInfoMap.java:133)
at htsjdk.samtools.CoordinateSortedPairInfoMap.remove(CoordinateSortedPairInfoMap.java:86)
at picard.sam.markduplicates.util.DiskBasedReadEndsForMarkDuplicatesMap.remove(DiskBasedReadEndsForMarkDuplicatesMap.java:61)
at picard.sam.markduplicates.MarkDuplicates.buildSortedReadEndLists(MarkDuplicates.java:559)
at picard.sam.markduplicates.MarkDuplicates.doWork(MarkDuplicates.java:257)
at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:308)
at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:103)
at picard.cmdline.PicardCommandLine.main(PicardCommandLine.java:113)
So I'm thinking that there's a problem with the FASTQ files, not the way in which the BAMS were generated.
Thanks - Robert
What aligner did you use to align these files? Were the BAM files for multiple lanes merged after alignments?
Some things to consider in this thread: Markduplicates: Value Was Put Into Pairinfomap More Than Once
If your Fastq files actually have duplicate reads then you will need to fix that.