Question

bbmerge (bbmap) ~ error with insert size file output

0

Entering edit mode

3 months ago

chrisk • 0

Greetings Biostars community,

We have one RNA sequencing, paired-end, 150bp file which we would like to analyse for average insert size.

It is confirmed (and solved in another thread) that we have significant overlapping paired end reads.

A side question we are chasing is to understand this overlapping issue a bit better to improve the wet lab.

We are thus using bbmerge (bbmap) to check the insert size of each read pair.

This is the code:

bbmerge.sh in1=ZZZ_R1.fastq in2=ZZZ_R2.fastq out=zzz_S248_merged.fastq outu1=zzz_S248_unmerged_R1.fastq outu2=zzz_248_unmerged_R2.fastq outinsert=zzz_248_insert.txt

This is our Java version:

openjdk version "11.0.13" 2021-10-19

The showhiststats which prints to screen is correct.

However the "--outinsert=<file> (outi) File to write read names and insert sizes" lists an array of numbers per read pair, that does not resemble the headers (#id,numericID,insert,status,mismatches). Shown below:

This is the --outinsert:

#id     numericID       insert  status  mismatches
[1.4, 0.11825006, 0.1271383, 0.51, 0.51, 1.0760001, 0.39759037, 0.2, 0.2, 0.016666668, 0.2, 1.1600001, 0.19354838, 0.6261682, 0.2, 0.5208333, 0.54545456, 1.4959016, 0.7125, 0.0, 0.73913044, 0.18056303, 1.0000015]
[1.3000001, 0.7025076, 0.27375388, 0.51, 0.51, 0.82400006, 0.65753424, 0.2, 0.2, 0.005729167, 0.2, 1.072, 0.4047619, 0.8264642, 0.2, 0.54705876, 0.7916667, 1.5382459, 2.25625, 0.0, 0.740458, 0.8629876, 1.0000015]
[1.2, 0.03901675, 0.109361805, 0.51, 0.51, 0.8880001, 0.61538464, 0.2, 0.2, 0.006875, 0.2, 1.184, 0.10714286, 0.42028987, 0.2, 0.40833333, 0.2857143, 1.398717, 0.2375, 0.0, 0.9310345, 0.037989855, 1.0000015]
[1.1, 0.047905006, 0.036053997, 0.51, 0.51, 1.0680001, 0.4117647, 0.2, 0.2, 0.015714286, 0.2, 1.172, 0.15254237, 0.54545456, 0.2, 0.48333332, 0.44444445, 1.4603845, 0.475, 0.0, 0.8, 0.019994915, 1.0000015]
[1.5, 0.028471988, 0.022546485, 0.13, 0.13, 0.512, 0.6621622, 0.2, 0.2, 0.005612245, 0.2, 0.79200006, 0.35897437, 0.7241379, 0.2, 0.35892853, 0.6666667, 1.3513445, 1.1874999, 0.0, 0.77952754, 0.04518783, 1.0000015]
[1.2, 0.030128494, 0.044942245, 0.51, 0.51, 0.63600004, 0.74093264, 0.2, 0.2, 0.0038461538, 0.2, 1.1880001, 0.09090909, 0.42028987, 0.2, 0.49, 0.2857143, 1.4842912, 0.2375, 0.0, 0.96644294, 0.072181046, 1.0000015]
[1.2, 0.115287304, 0.11825004, 0.51, 0.51, 0.75600004, 0.69325155, 0.2, 0.2, 0.0048672566, 0.2, 1.128, 0.2857143, 0.7241379, 0.2, 0.50249994, 0.6666667, 1.4952222, 1.1874999, 0.0, 0.8507463, 0.13936865, 1.0000015]
[1.2, 0.030128494, 0.036053993, 0.51, 0.51, 0.69600004, 0.71910113, 0.2, 0.2, 0.004296875, 0.2, 1.176, 0.13793103, 0.49044585, 0.2, 0.42499998, 0.375, 1.4189031, 0.35625, 0.0, 0.94160587, 0.057184458, 1.0000015]
[1.4, 0.030128494, 0.036053997, 0.51, 0.51, 0.78000003, 0.6815287, 0.2, 0.2, 0.005140187, 0.2, 1.12, 0.30555555, 0.7411003, 0.2, 0.49999997, 0.6875, 1.4923291, 1.3062499, 0.0, 0.83076924, 0.045787394, 1.0000015]

We tried different sized paired fastq files as separate runs, and then a shorter paired fastq file where we changed the headers.

This is what prints to screen for the shorter file, and the --outinsert remains as in the example above:

java -ea -Xmx1000m -Xms1000m -Djava.library.path=bbmap/jni/ -cp bbmap/current/ jgi.BBMerge in1=ZZZ_R1.fastq in2=ZZZ_R2.fastq out=zzz_S248_merged.fastq outu1=zzz_S248_unmerged_R1.fastq outu2=zzz_248_unmerged_R2.fastq outi=zzz_248_insert.txt
Picked up _JAVA_OPTIONS: -Dawt.useSystemAAFontSettings=on -Dswing.aatext=true
Executing jgi.BBMerge [in1=ZZZ_R1.fastq, in2=ZZZ_R2.fastq, out=zzz_S248_merged.fastq, outu1=zzz_S248_unmerged_R1.fastq, outu2=zzz_248_unmerged_R2.fastq, outinsert=zzz_248_insert.txt]
Version 39.06

Writing mergable reads merged.
Started output threads.
Total time: 0.096 seconds.


Pairs:                  10
Joined:                 10              100.000%
Ambiguous:              0               0.000%
No Solution:            0               0.000%
Too Short:              0               0.000%


Avg Insert:             185.1
Standard Deviation:     30.5
Mode:                   169


Insert range:           149 - 231
90th percentile:        227
75th percentile:        198
50th percentile:        169
25th percentile:        153
10th percentile:        149

Has anyone encountered this before and know where things have gone wrong? We have seen what should be the output in this thread: What are the definitions of the BBMerge outinsert table file?

Much appreciated if anyone can assist, Chris.

bbtools bbmerge bbmap • 483 views

ADD COMMENT • link updated 3 months ago by GenoMax 145k • written 3 months ago by chrisk • 0

1

Entering edit mode

Ah, this is a little embarrassing; those are the vectors for the neural network. I overloaded a field that was being used for the insert size output to use for that vector when I added support for that... unfortunately, it looks like you'll need to go back to v39.01 to restore the "outi" functionality. I'll try to re-enable that for the next release (or at least add renaming read headers with the insert size).

It's worth noting, however, that adding neural network support dropped the false positive merge rate by ~50% at the same true positive rate, compared to the prior implementation.

ADD REPLY • link 3 months ago by Brian Bushnell 20k

0

Entering edit mode

Hi Brian,

Thank you and apologies for the extra future revision works.

Out of curiosity we disabled neural networks 'nn=f' and the file only had the header (#id numericID insert status mismatches).

We will roll back to version 39.01 and test it.

This is our pipeline:

The project is 300 150bp Paired-end RNA fastq samples sequenced on NovaSeq6000.

Our pipeline was Trimmomatic, HISAT2, UMI-tools Dedup, BAM Utils clipOverlap (A large percentage of PE reads (~90%) are overlapping), MAJIQ.

MAJIQ developers confirmed that we needed to remove duplicate splicing events from quantification, caused by the PE overlap. After performing clipoverlap, we were interested then in relaying insert size information to the wetlab team for future works.

Thus we decided to convert our clipOverlap output bam files (Proper pairs and overlapped only) back to fastq format and pass through your bbmerge program using 'ihist' and 'outinsert' to obtain the required data.

Thanks in advance and thank you for the fast response. Chris

ADD REPLY • link 3 months ago by chrisk • 0

1

Entering edit mode

It may be best to go back to the original data and then do bbmerge.sh on it to get original insert size estimates (before any operations happened on the data).

ADD REPLY • link 3 months ago by GenoMax 145k

0

Entering edit mode

It is confirmed (and solved in another thread) that we have significant overlapping paired end reads.

Can you provide a reference to said thread?

ADD REPLY • link 3 months ago by GenoMax 145k

0

Entering edit mode

Hi Genomax,

This is the thread:

Merge overlapping paired end reads from BAM file.

Cheers, Chris

ADD REPLY • link 3 months ago by chrisk • 0