Question

Strange pattern of missing merged reads using SeqPrep

0

Entering edit mode

3.3 years ago

jessicaathomas • 0

Hello, I was wondering if someone could help me?

I've been trying to adapter trim and merge my dataset using Seqprep, but when I plot the read lengths after merging, I'm missing most of the reads between 40 and 50bp. I can't work out why, or whether I'm doing something wrong!

So: read length plots resemble a curve with a stepped gap enter image description here

I'm running SeqPrep as follows:

SeqPrep -f L120_1.qual.fastq -r L120_2_.qual.fastq -1 L120-R1.qual.unmerged.fastq -2 L120-R2.qual.unmerged.fastq -3 L120_NeutCap_2-R1.qual.discarded.fastq -4 L120_NeutCap_2-R2.qual.discarded.fastq -L 30 -q 15 -A AGATCGGAAGAGCACACGTC -B GGAAGAGCGTCGTGTAGGGA -s L120_NeutCap_2.qual.merged.fastq -E L120_NeutCap_2.qual.readable_alignment.txt -o 10

You'll notice that while the first adapter is the standard illumina one, but the second is a modified one, missing the first 5 bp. You can see both adapters present in the file if you grep the sequences (indicated below with [xx])…

Read1 quality trimmed, L120_2 above:

@HISEQ:268:C8TMGANXX:2:1101:1430:1965 1:N:0:NTCGTCGGNCGCAACG CAGGCACTCCCTGGAAACTCTAAGGGGCAGTTCTACTCT[AGATCGGAAGA] + A@B0BGGGGGGGCFGGGGGGGGGGGEGGGGGGGGGGCGG@1E@FGD/CEF @HISEQ:268:C8TMGANXX:2:1101:1457:1992 1:N:0:TTCGTCGGNCGCAACG CTAGACCGCGAATACACACA[AGATCGGAAGAGCACACGTCTGAACTCCAG] + 33<<bgggggggggggggggggggggfggggggggggggggbgggggggg @hiseq:268:c8tmganxx:2:1101:1684:1955="" 1:n:0:ttcgtcggccgcaacg="" ntgatatgtccggagtgcatcgtatggcgctttcaatgaatttg[agatcg]="" +="" #3<<@eggggggggggggggggggggggggggggggggggggggeggggg<="" p="">

@HISEQ:268:C8TMGANXX:2:1101:1619:1977 1:N:0:TTCGTCGGCCGCAACG CGGTGCCATCGAGCCTGTTCTGTCTCATAGTGACCCT[AGATCGGAAGAGC] + 33@>@GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG @HISEQ:268:C8TMGANXX:2:1101:1574:1983 1:N:0:TTCGTCGGCCGCAACG CCATCCTAGTGGGGGGAAAT[AGATCGGAAGAGCACACGTCTGAACTCCAA] + <330<e1effcgggggfgecdgeggfgbdcddgeggggcd0ddcdg=ebc< p="">

Read 2, quality trimmed, for L120_2 above.

@HISEQ:268:C8TMGANXX:2:1101:1430:1965 2:N:0:NTCGTCGGNCGCAACG AGAGTAGAACTGCCCCNNNNAGTTTCCAGGGAGTGCCTG[GGAAGAGCGTC] + BB@BBGGDFGGGGGGG####==EFGDFFGGGGGGGGGGGGEGGGGGGGGF @HISEQ:268:C8TMGANXX:2:1101:1457:1992 2:N:0:TTCGTCGGNCGCAACG TGTGTGTATTCGCGGTCTATGGAAGAGCGTCGTGTAG[GGAAAGAGTGTCG] + CCCCCGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG @HISEQ:268:C8TMGANXX:2:1101:1684:1955 2:N:0:TTCGTCGGCCGCAACG CAAATTCATTGAAAGNNNNNTACGATGCACTCCGGACATATCAT[GGAAGA] + CCCCCGGGGGGGGGG#####@=EFGGGGGGGGGGGGGGGGGGGGGGGGGG @HISEQ:268:C8TMGANXX:2:1101:1619:1977 2:N:0:TTCGTCGGCCGCAACG AGGGTCACTATGAGACAGAACAGGCTCGATGGCACCT[GGAAGAGCGTCGT] + CCCCCGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG @HISEQ:268:C8TMGANXX:2:1101:1574:1983 2:N:0:TTCGTCGGCCGCAACG ATTTCCCCCCACTAGGATGT[GGAAGAGCGTCGTGTAGGGAAAGAGTGTCG] + BCCCCGGGGGDGGGGGGGGGGGGGGGGGGGDGGGGGGGGGGGGGGGGGFG

So I think the adapter sequences are correct, but I can't explain why there's a dip in the read length frequency. Is this a quirk of SeqPrep? Can anyone offer any explanation?

I should also add, that the depth of this dip differs between my different samples (i.e. some sample have barely any reads between 40 and 50bp, whereas some have barely any missing). The only thing which differs between samples is the 8bp index, found within the adapter sequence. I'm not sure how Seqprep removes the adapter sequence, but I don't think this should affect it? Again, any thoughts welcome.

Many thanks!

sequencing seqPrep merging reads • 459 views

ADD COMMENT • link 3.3 years ago by jessicaathomas • 0

0

Entering edit mode

I should add that when I've merged reads using flash, I don't see this gap, and there are many more merged reads - but I'd like to use Seqprep in order to match previous studies methods. I just can't figure out what is going on!

ADD REPLY • link 3.3 years ago by jessicaathomas • 0