Question

bbmerge merged read smaller that original reads

0

Entering edit mode

5.0 years ago

snishtala03 ▴ 70

Hello,

I have some paired end (2x150 bps) RNA-Seq reads from MiSeq for a viral genome. I need to merge the reads for a downstream analysis.(Also, since I noticed that when I merge my reads, there are a lot of reads which have a high overlap rate, merging them makes sense) -

bbmerge.sh in1=R1.fastq in2=R2.fastq out=merged.fastq outu1=R1_unmerged.fastq outu2=R2_unmerged.fastq

Here is the terminal output of bbmerge I get -

Pairs:                  3328768
Joined:                 2925342         87.881%
Ambiguous:              370409          11.128%
No Solution:            33017           0.992%
Too Short:              0               0.000%
Avg Insert:             176.0
Standard Deviation:     44.0
Mode:                   147

Insert range:           35 - 293
90th percentile:        243
75th percentile:        204
50th percentile:        167
25th percentile:        142
10th percentile:        126

Now, I use bwa to align to my reference genome allowing secondary alignments and there are a lot of cases where a read does align to multiple regions on the genome. When I was going over the alignments, I found some strange behaviour of the merged reads Where I see -

Merged read is smaller than the original read, for example:

@M02091:32:000000000-C28N4:1:1106:22793:14654 1:N:0:7
GTCTTTGGGTATACATTTGAACCCTAATAAAACCAAACGTTGGGGCTACTCCCTTAACTTCATGGGATATGTAATTGGAAGTTGGGGTACTTTACCACAGGAACATATTGTAATGAAACTCAAGCAATGTTTTCGGAAACTGCCTGTAAAT
+
DCEEEFFFEBFFGGGGGGGGGGHGHHHHHHHHHGHHHHGHHHHGGGGHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHGGGGHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHGGGGGHHHHHHHHHHHH 

@M02091:32:000000000-C28N4:1:1106:22793:14654 2:N:0:7
AAAGAATTGTGGGTCTTTTGGGCTTTGCTGCCCCTTTTACACAATGTGGCTATCCTGCTTTGACAGACTTTCCAATCAATAGGTCTATTTACAGGCAGTTTCCGAAAACATTGCTTGAGTTTCATTACAATATGTTCCTGTGGTAAAGTAC
+
CCDDDFFFFFFCGGGGGGGGGGHHHHHHHHHHHGHHHHHHHHHGHHGHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHGHGHHHHHGGGGGHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHF

Merged read is -

AAAGAATTGTGGGTCTTTTGGGCTTTGCTGCCCCTTTTACACAATGTGGCTATCCTGCTTTGA

I tried using alternative merging softwares like vsearch and flash as well to compare my results and interestingly, using both flash and vsearch, I see this this read pair to be merged correctly (see below) but a similar case comes up with a different example -

AAAGAATTGTGGGTCTTTTGGGCTTTGCTGCCCCTTTTACACAATGTGGCTATCCTGCTTTGACAGACTTTCCAATCAATAGGTCTATTTACAGGCAGTTTCCGAAAACATTGCTTGAGTTTCATTACAATATGTTCCTGTGGTAAAGTACCCCAACTTTCAATTACATAACCCATGAAGTTAAGGGAGTAGCCCCAACGTTTGGTTTTATTAGGGTTCAAATGTATACCCAAAGAC

My command line for v search is -

vsearch --fastq_mergepairs R1.fastq --reverse R2.fastq --eetabbedout error_stats --fastqout merged.fastq --fastqout_notmerged_fwd fw_unmerged.fastq --fastqout_notmerged_rev rev_unmerged.fastq

My command line for flash is -

flash R1.fastq R2.fastq -M 151

This question is not about merging but more about the nature of my reads. As you can see from the example above, my R1 undergoes reverse complement which shows that for the R1.fastq and R2.fastq files have a mix of forward and reverse reads. Is there a way I can solve this and put all R1 reads in one file and R2 reads in the other file. I am trying to remove duplicates after I align my reads, and this is causing problem as it prevents reads from being deduplicated correctly.

bbmerge fastq alignment flash vsearch • 3.3k views

ADD COMMENT • link updated 5.0 years ago by Ram 43k • written 5.0 years ago by snishtala03 ▴ 70

score 1 · Answer 1 · 2019-04-05

Which version of BBTools are you using? I just tested the sequence you provided as example and it bbmerge.sh (BBTools 38.43) merged the pairs correctly:

@M02091:32:000000000-C28N4:1:1106:22793:14654 1:N:0:7
GTCTTTGGGTATACATTTGAACCCTAATAAAACCAAACGTTGGGGCTACTCCCTTAACTTCATGGGATATGTAATTGGAAGTTGGGGTACTTTACCACAGGAACATATTGTAATGAAACTCAAGCAATGTTTTCGGAAACTGCCTGTAAATAGACCTATTGATTGGAAAGTCTGTCAAAGCAGGATAGCCACATTGTGTAAAAGGGGCAGCAAAGCCCAAAAGACCCACAATTCTTT
+
DCEEEFFFEBFFGGGGGGGGGGHGHHHHHHHHHGHHHHGHHHHGGGGHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHGJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHGHHGHHHHHHHHHGHHHHHHHHHHHGGGGGGGGGGCFFFFFFDDDCC

In addition, the merged read you showed as example has a very strange substitution at position 160. At this position the reads do not overlap, so the consensus should correspond to read 1. However, there is a T at the consensus read, while it is a C at the original read 1. Is the example you showed from vsearch or flash? Does any of them perform some form of error correction?