Hi there,
We recently had some DNA samples sequenced at Novogene UK: amplicon sequencing of the V3–V4 subregions of the 16S rRNA gene.
The final sequencing data was provided to us in 'RawData' and 'CleanData' formats.
The main thing I’m confused about is that the ‘RawData’ doesn’t actually seem to be raw, and I couldn’t find any information about exactly what processing has been done on the ‘RawData’. I’ve sent Novogene an email, but no reply yet.
For 16S amplicon data, Novogene does 2 x 250 bp paired-end sequencing on the Illumina NovaSeq 6000 system.
However, the length of the ‘RawData’ reads they sent (at least for my samples) isn’t 250 nt. It’s 227 nt for forward reads, and 224 for reverse reads.
Based on this, I am assuming they have already trimmed the primers away from the sequences, and an extra 6 nt in each case too (see calculations below):
F primer: CCTAYGGGRBGCASCAG (17 nt)
R primer: GGACTACNNGGGTATCTAAT (20 nt)
250 – 17 = 233
250 – 20 = 230
250 – 17 – 6 = 227
250 – 20 – 6 = 224
The extra 6 nt could have been taken from the 5’ or 3’ end of the read, or a combination of both. We don’t know.
When I ran DADA2 (with no extra trimming) most of the forward and reverse reads did successfully merge, so that’s further confirmation that the initial trimming was likely done (at least mostly) at the beginning of the reads, since there is apparently still enough overlap for merging.
Even though the primers appear to have been trimmed away already, I found primer sequences in some of the reads, but not right at the start, which is weird to me (see below). When we did 16S sequencing with a different company earlier (Macrogen), the primers were left in the reads when we received the data, but they were right at the beginning in each case, so could easily be removed by trimming away the first 20 or so bases. The situation here seems a bit different.
For example, the primer sequence CCTAYGGGRBGCASCAG is found in these forward reads:
@A01426:481:HFYHFDRX2:1:2103:18222:23938 1:N:0:TACGACGT+CCATGAAC GGCGACGATCCTTAGCTGGTCTGAGAGGATGATCAGCCACACTGGGACTGAGACACGGCCCAGACTCCTACGGGAGGCAGCAGTGGGGAATCTTGGACAATGGGCGAAAGCCTGATCCAGCCATGCCGCGTGAGTGATGAAGGCGTTAGGGGGGTAAAGCTCTTTTGGCCGGGAGGATGATGGCAGTGCCGGGCGGGTCTGGTCGGGGGGCGGCGGGGGCGGGGGGG + FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFF,FFFFFFFFFFFFFFFF,FF:FFFFFF,:FFFF:FFFFF,F,FFFFFF::FFFFFFFFFFFFFF,FF:FFF:FFF,:FFF,F:F:F,,F::FF,,,,:FFF,,F,,,:F,,,,,F,,,:,FFF,,F:,,,:F,,,F:,,,F:F,F,,,,:F,F,F:F,,,,,:,,FFF:,:FF,F,,
@A01426:481:HFYHFDRX2:1:2115:12843:7294 1:N:0:TACGACGT+CCATGAAC ATACCCCCGTAGTCCCGAAACAGATACCCATGTAGTCCGCTCATAGAGAGGGATGCTCTTCCGATGTCCTATGGGAGGCAGCAGTGGGGAATCTTGCACAATGGAGGAAACTCTGATGCAGCGATGCCGCGTGAGTGAAGACGGCCTTTGGGTTGTAAAGCTCTTTTGTAGGGGAAGATAATGACTGTAACCTAAGAATAAGGTCCGGCTAACTTCGTGCCCGCAGC + FFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFF:FFFFFF,FFFFFF:FF:FFFFFFFFFFF,FF:,F,F,F:FF:FF:FFFFFFFFFFFFFFFF::FFFFFFF,FFFFFF:FF,FF:F,,FFFFF,,FFFF:FFFF,FFFFFFF:,FFFFF
And the primer sequence GGACTACNNGGGTATCTAAT is found in these reverse reads:
@A01426:481:HFYHFDRX2:1:2128:10022:6543 2:N:0:TACGACGT+CCATGAAC GTTTCGGGACTACACGGGTATCTAATCCTGTTTGATCCCCACGCTTTCGTGTCTCAGCGTCAGTTACAATTTAGCAAGCTGCCTTCGCAATCGGTGTTCTGTGTGATCTCTAAGCATTTCACCGCTACACCACACATTCCGCCTACTTCAATTGTACTCAAGAATATCAGTTTCTATGGCAGTTCTACAGTTAAGCTGTAGGCTTTCACCACTGACTTAATACC + FFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFF:F:FFF,FFFFFFF:FFFFFFFFFFF:FFFF:FFFF
@A01426:481:HFYHFDRX2:1:2134:2871:31203 2:N:0:TACGACGT+CCATGAAC TCTGCTGCATCCAGGAGGCTGATCGAGTTGTTTAGGGACTACGAGGGTATCTAATACTGTTTGCTCCCCACGCTTTCGTGCATGAGCGTCATTGTTATCCCAGGGGGCTGCCTTCGCCAATGGTATTCCTCCACAGATCTACGCATTTCACTGCTAAACGTGGAATTCTACAACCCTATGACACACACTAGATATACAGTCACATGCGCAATACCCAGGTTAAG + ,FFF:FFFFF:F::,F,,,FF:FFFF:,:,,F,,FF:FF,F:FF,,FFFFF,F:F,:F,F,F:,:FFFFF:,,FF,,,F::F:,F,FF,FF,,::FFF:FFF:F,FFF:,::F,,FF,F,FF:,FFFF,FFF,FF,FF,F:FFFF,FFF:,F,FFF,F,FFFF,FF,F,FF,,FFF,,F,F:FFF,,,F:,::FFFFF,F,FFF:FF,F,FF,,F,FF,F:F:F
Does anyone know what's going on here? All help appreciated!
Cheers,
Kevin
I can't imagine that strangers on the internet could provide a better answer to this question than a company itself. If this was a 10-year old sample, I would understand why the company records may not be available.
I've asked the company, but they haven't replied yet, so I posted here too because I'd like to get moving on this project ASAP and would like to avoid unnecessary waiting around if possible.