Question

Insert size historgram from Picard for Illumina paried end 150 bp: FR, TANDEM, and both

0

Entering edit mode

2.4 years ago

robertwhbaldwin • 0

I'm got some low coverage skim-seq bam files (1x) and was doing qc on them and got some strange results. I ran Picard CollectInsertSizeMetrics. The sequencing was done by Illumina paired end and the orientation was be F-R as usual. But I got insert size histograms showing FR, TANDEM, and a mix of both FR and TANDEM. An example of the FR and TANDEM result is below:

enter image description here

I'm not sure how this is possible. The BAMS each had two fastq (R1 and R2). Is it possible they mapped R1 and R1 instead of R1 and R2? Or maybe a demultiplexing problem?

For the samples that have TANDEM in them I also noticed that none of them make it through Picard MarkDuplicates when I try to regenerate the BAMS using the proper R1 and R2 fastq. I get this error:

14:34:28.057 INFO  NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/home/robert/tools/picard.jar!/com/intel/gkl/native/libgkl_compression.so
[Wed Dec 01 14:34:28 AST 2021] MarkDuplicates INPUT=[./CONTROL-2013-399-918-2_sorted.bam] OUTPUT=./CONTROL-2013-399-918-2_sorted_marked.bam METRICS_FILE=./CONTROL-2013-399-918-2_marked_dup_metrics.txt ASSUME_SORT_ORDER=coordinate    MAX_SEQUENCES_FOR_DISK_READ_ENDS_MAP=50000 MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=8000 SORTING_COLLECTION_SIZE_RATIO=0.25 TAG_DUPLICATE_SET_MEMBERS=false REMOVE_SEQUENCING_DUPLICATES=false TAGGING_POLICY=DontTag CLEAR_DT=true DUPLEX_UMI=false ADD_PG_TAG_TO_READS=true REMOVE_DUPLICATES=false ASSUME_SORTED=false DUPLICATE_SCORING_STRATEGY=SUM_OF_BASE_QUALITIES PROGRAM_RECORD_ID=MarkDuplicates PROGRAM_GROUP_NAME=MarkDuplicates READ_NAME_REGEX=<optimized capture of last three ':' separated fields as numeric values> OPTICAL_DUPLICATE_PIXEL_DISTANCE=100 MAX_OPTICAL_DUPLICATE_SET_SIZE=300000 VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json USE_JDK_DEFLATER=false USE_JDK_INFLATER=false
[Wed Dec 01 14:34:28 AST 2021] Executing as robert@robert-ThinkStation-P340 on Linux 5.10.0-1051-oem amd64; OpenJDK 64-Bit Server VM 11.0.9.1-internal+0-adhoc..src; Deflater: Intel; Inflater: Intel; Provider GCS is not available; Picard version: 2.25.1
INFO    2021-12-01 14:34:28 MarkDuplicates  Start of doWork freeMemory: 36229840; totalMemory: 41943040; maxMemory: 16785604608
INFO    2021-12-01 14:34:28 MarkDuplicates  Reading input file and constructing read end information.
INFO    2021-12-01 14:34:28 MarkDuplicates  Will retain up to 60817408 data points before spilling to disk.
INFO    2021-12-01 14:34:30 MarkDuplicates  Read     1,000,000 records.  Elapsed time: 00:00:02s.  Time for last 1,000,000:    2s.  Last read position: NC_048565.1:43,242,107
INFO    2021-12-01 14:34:30 MarkDuplicates  Tracking 41067 as yet unmatched pairs. 536 records in RAM.
INFO    2021-12-01 14:34:33 MarkDuplicates  Read     2,000,000 records.  Elapsed time: 00:00:04s.  Time for last 1,000,000:    2s.  Last read position: NC_048565.1:88,419,569
INFO    2021-12-01 14:34:33 MarkDuplicates  Tracking 61376 as yet unmatched pairs. 352 records in RAM.
[Wed Dec 01 14:34:33 AST 2021] picard.sam.markduplicates.MarkDuplicates done. Elapsed time: 0.09 minutes.
Runtime.totalMemory()=2361393152
To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp
Exception in thread "main" htsjdk.samtools.SAMException: Value was put into PairInfoMap more than once.  1: RGA01180:82:HTT3KDSX2:1:2207:19298:32346
    at htsjdk.samtools.CoordinateSortedPairInfoMap.ensureSequenceLoaded(CoordinateSortedPairInfoMap.java:133)
    at htsjdk.samtools.CoordinateSortedPairInfoMap.remove(CoordinateSortedPairInfoMap.java:86)
    at picard.sam.markduplicates.util.DiskBasedReadEndsForMarkDuplicatesMap.remove(DiskBasedReadEndsForMarkDuplicatesMap.java:61)
    at picard.sam.markduplicates.MarkDuplicates.buildSortedReadEndLists(MarkDuplicates.java:559)
    at picard.sam.markduplicates.MarkDuplicates.doWork(MarkDuplicates.java:257)
    at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:308)
    at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:103)
    at picard.cmdline.PicardCommandLine.main(PicardCommandLine.java:113)

So I'm thinking that there's a problem with the FASTQ files, not the way in which the BAMS were generated.

Thanks - Robert

Illumina bam insert-size • 1.1k views

ADD COMMENT • link updated 13 months ago by Ram 43k • written 2.4 years ago by robertwhbaldwin • 0

0

Entering edit mode

What aligner did you use to align these files? Were the BAM files for multiple lanes merged after alignments?

Some things to consider in this thread: Markduplicates: Value Was Put Into Pairinfomap More Than Once

If your Fastq files actually have duplicate reads then you will need to fix that.

ADD REPLY • link 2.4 years ago by GenoMax 141k