Question

Removing Illumina PCR primer sequences from TSA for GenBank upload

0

Entering edit mode

3.1 years ago

Michael 54k

Hi,

I am trying to upload a transcriptome assembly to GenBank but there seems to be some vector contamination:

This sequence has a Strong match on the following UniVec vector: gnl|uv|NGB00360.1:1-58 Illumina PCR Primer (Oligonucleotide sequence copyright 2007-2012 Illumina, Inc. All rights reserved.)

I have checked the sequences with Blast against univec and the vectors appear at the ends of the sequence. I am a bit confused though because I haven't seen them before. What are these PCR primers and are they different from the Illumina adapter sequences or would have adapter clipping solved this issue (I didn't do the assembly)?

Should I remove the complete contigs or is it enough to trim the vector sequences off and keep the rest?

I tried cutadapt but that isn't sensitive enough, some hits are filtered but a large number > 50% remain unfiltered:

 cutadapt -b file:primers.fasta -n 10 -rc -o in.fa.trimmed -m200 --cores=0 in.fa

primers.fasta:
>uv:NGB00360.1:1-58 Illumina PCR Primer (Oligonucleotide sequence copyright 2007-2012 Illumina, Inc. All rights reserved.)
AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT
>uv:NGB00852.1:1-64 NEBNext Index 6 Primer for Illumina
CAAGCAGAAGACGGCATACGAGATATTGGCGTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT

So in the end, I might simply remove the offending sequences by blast vs UniVec and filter all hits that contain "Illumina". Edit: For a complete work-flow see my answer below.

GenBank assembly primer • 2.4k views

ADD COMMENT • link 3.1 years ago by Michael 54k

0

Entering edit mode

Michael Dondrup : If you are willing give bbduk.sh from BBMap suite a try. A GUIDE is available. I am happy to help if needed.

ADD REPLY • link 3.1 years ago by GenoMax 141k

0

Entering edit mode

$ bbduk.sh in=in.fa out=bbduk.fa ref=illumina.primers.fasta ktrim=r k=21 rcomp=t mink=11 hdist=2 minlen=200

With the same blastn benchmark against UniVec, there are only 1435 left.

Blazing fast...

Processing time:        1.483 seconds.

Input:                      179617 reads        129380317 bases.
KTrimmed:                   15179 reads (8.45%)     4346383 bases (3.36%)
Total Removed:              3999 reads (2.23%)  4346383 bases (3.36%)
Result:                     175618 reads (97.77%)   125033934 bases (96.64%)

Time:                           2.885 seconds.
Reads Processed:        179k    62.25k reads/sec
Bases Processed:        129m    44.84m bases/sec

ADD REPLY • link 3.1 years ago by Michael 54k

0

Entering edit mode

Can you remove mink= ? Also add ktrim=rl. Is k= set to less than 1/2 length of the smallest reference primer you have? If not can you do that and retry?

ADD REPLY • link 3.1 years ago by GenoMax 141k

0

Entering edit mode

Like so? Minimum length of Illumina primers is 22bp, so I tried with k=10

    bbduk.sh in=in.fa out=bbduk-2.fa ref=illumina.primers.fasta ktrim=r k=10 rcomp=t  minlen=200

That removes a bit too much:

    Input:                      179617 reads        129380317 bases.
    KTrimmed:                   170194 reads (94.75%)   120907091 bases (93.45%)
   Total Removed:          >149542 reads (83.26%)<  120907091 bases (93.45%)
    Result:                     30075 reads (16....

Whenever I set hdist > 0, it removes 100%

ADD REPLY • link 3.1 years ago by Michael 54k

0

Entering edit mode

with ktrim=rl:

Exception in thread "main" java.lang.AssertionError:
Invalid setting for ktrim - values must be f (false), l (left), r (right), or n.

ADD REPLY • link 3.1 years ago by Michael 54k

0

Entering edit mode

I wonder if you have some adapter sequences at the beginning of sequences which is leading to removal of all sequences with you use a short k. This is an interesting problem and I would not mind playing with the data but if you are happy with cutadapt result then that is fine. It would be curious to see if NCBI accepts your new submission.

One could do something like following to ktrim= in both directions.

bbduk.sh in=in.fa out=stdout.fa ref=illumina.primers.fasta ktrim=r k=10 rcomp=t  hdist=2 minlen=200 | bbduk.sh in=stdin.fa out=bbduk-3.fa ref=illumina.primers.fasta ktrim=l k=10 rcomp=t  hdist=2 minlen=200

ADD REPLY • link 3.1 years ago by GenoMax 141k

0

Entering edit mode

Thank your for this comment! The NCBI uploader should accept any submission where my benchmark gives 0 hits, because that is what they are using. However, I suspect that if the adapters were not properly trimmed, I now have an assembly where contigs are mostly linked by adapters, was that what you meant. How could I check this?

Given the chained command, already the first step removes 100%.

   Total Removed:           179617 reads (100.00%)  129380317 bases (100.00%)

ADD REPLY • link 3.1 years ago by Michael 54k

0

Entering edit mode

That was not what I was thinking but I suppose that is a possibility. With bbduk once you find a k-mer that matches, in the trim mode all the sequence to the right of that match is eliminated (when you use ktrim=r). That is what is likely removing all of your sequences. This is generally fine since with Illumina adapters once you run into adapter sequence you can get rid of the rest (applies to primer dimers too) of the sequence to right.

I hope no adapters remained in your data before assembly. That is why bbduk has a overlap mode tbo and tpe which ensures that even smallest (down to 1 bp) contaminants are removed from original data.

ADD REPLY • link 3.1 years ago by GenoMax 141k

0

Entering edit mode

I have checked two more Trinity assembles I made with trimmomatic clipping and checked with fastqc for adapter content. The removed rates are similar to the assembly I have to upload, so I guess this is somewhat normal.

ADD REPLY • link 3.1 years ago by Michael 54k

0

Entering edit mode

Let us know what NCBI says. They could be using k-mer matching instead of blast to check for contaminants.

ADD REPLY • link 3.1 years ago by GenoMax 141k

0

Entering edit mode

Almost :/ After a longer round of trial and error (remove trailing N's, but then some seqs become too short, so filter for length again...) only one error left:

contig_70602, Sequence has more than 5 Ns in the last 10 bases or more than 15 Ns in the last 50 bases

Argh, going to fix that too...

I couldn't figure out how to test the above condition in seqkit, so I made a quick perlhack. And the uploaded file is finally accepted by the upload checker. So the pipeline:

cutadapt -> blastn vs univec -> seqkit grep -> seqkit remove N -> perlscript -> seqkit min-size -> GenBank

seems to work for a single test-case.

ADD REPLY • link 3.1 years ago by Michael 54k

score 2 · Accepted Answer · 2021-03-10

2

Entering edit mode

3.1 years ago

cpad0112 21k

your script has issue with -rc option. it is --rc. Because of this (-rc) removal of complementary sequence (primer) is not working. try following fasta with -rc and --rc options:

$ cat in.fa                                                                                                                                                                          

>uv:NGB00360.1:1-58 Illumina PCR Primer (Oligonucleotide sequence copyright 2007-2012 Illumina, Inc. All rights reserved.)
AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT
>uv:NGB00852.1:1-64 NEBNext Index 6 Primer for Illumina
CAAGCAGAAGACGGCATACGAGATATTGGCGTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT
>test
AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATT

In first case (-rc), first two will be removed and last one would be retained.

In later case (--rc), all three will be removed. Last sequence test is complementary of very first sequence of in.fa

ADD COMMENT • link 3.1 years ago by cpad0112 21k

0

Entering edit mode

(base) michaeld@kjempefuru /export/scratch/michaeld $ cutadapt -b file:primers.fasta -n 200 --rc -o out.fa  -e 0.3 --overlap 3 --cores=0 in.fa
This is cutadapt 2.9 with Python 3.6.10
Command line parameters: -b file:primers.fasta -n 200 --rc -o out.fa -e 0.3 --overlap 3 --cores=0 in.fa
Processing reads on 144 cores in single-end mode ...
[8<----------] 00:00:00             3 reads  @  66439.7 µs/read;   0.00 M reads/minute
Finished in 0.21 s (69169 us/read; 0.00 M reads/minute).

=== Summary ===

Total reads processed:                       3
Reads with adapters:                         3 (100.0%)
Reverse-complemented:                        1 (33.3%)
Reads written (passing filters):             3 (100.0%)

Total basepairs processed:           180 bp
Total written (filtered):              0 bp (0.0%)

=== Adapter uv:NGB00360.1:1-58 ===

Sequence: AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT; Type: variable 5'/3'; Length: 58; Trimmed: 2 times; Reverse-complemented: 1 times
2 times, it overlapped the 5' end of a read
0 times, it overlapped the 3' end or was within the read

No. of allowed errors:
0-2 bp: 0; 3-5 bp: 1; 6-9 bp: 2; 10-12 bp: 3; 13-15 bp: 4; 16-19 bp: 5; 20-22 bp: 6; 23-25 bp: 7; 26-29 bp: 8; 30-32 bp: 9; 33-35 bp: 10; 36-39 bp: 11; 40-42 bp: 12; 43-45 bp: 13; 46-49 bp: 14; 50-52 bp: 15; 53-55 bp: 16; 56-58 bp: 17

Overview of removed sequences (5')
length  count   expect  max.err error counts
58  2   0.0 17  2



Overview of removed sequences (3' or within)
length  count   expect  max.err error counts


=== Adapter uv:NGB00852.1:1-64 ===

Sequence: CAAGCAGAAGACGGCATACGAGATATTGGCGTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT; Type: variable 5'/3'; Length: 64; Trimmed: 1 times; Reverse-complemented: 0 times
1 times, it overlapped the 5' end of a read
0 times, it overlapped the 3' end or was within the read

No. of allowed errors:
0-2 bp: 0; 3-5 bp: 1; 6-9 bp: 2; 10-12 bp: 3; 13-15 bp: 4; 16-19 bp: 5; 20-22 bp: 6; 23-25 bp: 7; 26-29 bp: 8; 30-32 bp: 9; 33-35 bp: 10; 36-39 bp: 11; 40-42 bp: 12; 43-45 bp: 13; 46-49 bp: 14; 50-52 bp: 15; 53-55 bp: 16; 56-59 bp: 17; 60-62 bp: 18; 63-64 bp: 19

Overview of removed sequences (5')
length  count   expect  max.err error counts
64  1   0.0 19  1



Overview of removed sequences (3' or within)
length  count   expect  max.err error counts

$cat out.fa
>uv:NGB00360.1:1-58 Illumina PCR Primer (Oligonucleotide sequence copyright 2007-2012 Illumina, Inc. All rights reserved.)

>uv:NGB00852.1:1-64 NEBNext Index 6 Primer for Illumina

>test rc

ADD REPLY • link 3.1 years ago by Michael 54k

1

Entering edit mode

Can you please add -m 1 option to the code? this would remove empty sequences (with names/IDs only, with no sequence)

ADD REPLY • link 3.1 years ago by cpad0112 21k

0

Entering edit mode

Cool, using the following blast command, I have only ~800 out of initially 10000 hits left (Blast parameters come from vectorscreen):

$ blastn -db UniVec -task blastn -outfmt "6 qacc qstart qend stitle" -reward 1 \ 
  -penalty -5 -gapopen 3 -gapextend 3  -dust yes \
   -soft_masking true -evalue 700 -searchsp 1750000000000 \
    -query out.cut.fa -max_target_seqs 10 \
   -num_threads 120 | grep -e"Illumina" | cut -f1 | sort -u | wc -l

I have to use -m 200 because of this being a GenBank submission. After that, I extracted all UniVec sequences that matched in the blast run and added them to the primers.fasta file. Now I am down to 500 hits, I guess I have to remove those completely.

ADD REPLY • link 3.1 years ago by Michael 54k

1

Entering edit mode

Understood..any thing above 0 to remove empty sequences. I hope you haven't mistyped -n 200 in OP instead of -m 200 in cutadapt command for filtering by length.

ADD REPLY • link 3.1 years ago by cpad0112 21k

0

Entering edit mode

I had mistyped -n 200 first, but then noticed it does allow for 200 adapter matches per sequence, which is fine in my case. I am not sure how much this slows down though.

ADD REPLY • link 3.1 years ago by Michael 54k

0

Entering edit mode

I guess you meant 200 times the primer match on each sequence. For trimming, n means times. Adapters are searched that many times in each sequence as I understand. This is useful for sequences with same adapter present multiple times.

ADD REPLY • link 3.1 years ago by cpad0112 21k

0

Entering edit mode

Ok, so I am making a last attempt to reach 0 hits. I have extracted all ~500 Illumina sequences from UniVec and put them into my primer file, now cutadapt has become a little slow, waiting to post the results....

It took 5 minutes now for 7000 seqs, that means about two hours for 170k sequences....

ADD REPLY • link 3.1 years ago by Michael 54k

score 2 · Accepted Answer · 2021-03-10

So here is a little benchmark bbduk.sh vs. cutadap. Both using all sequences having "Illumina" in their description from UniVec

file         format  type  num_seqs   sum_len        min_len  avg_len  max_len remaining_blastn_hits user_time
original.fa  FASTA   DNA    179,617   129,380,317      200    720.3   22,125  9548
bbduk.fa     FASTA   DNA    175,618   125,033,934      200      712   22,125  1435                  *2m33.636s*
cutdapt.fa   FASTA   DNA   *175,777*  *126,103,851*    200    717.4   22,109  *450*                 90m49.085s

The remaining blastn hits need to be removed entirely or the GenBank uploader will not accept the assembly. So, cutadapt filters and rescues more sequences but bbduk is much faster.

Methods

cutadapt (version 2.9) and bbduk (BBMap version 38.90) where run under Linux on a multi-cpu box using all sequences from the UniVec database ( https://www.ncbi.nlm.nih.gov/tools/vecscreen/univec/ ) matching "Illumina" in their FASTA header (seqs were extracted using python).

 bbduk.sh in=in.fa out=bbduk.fa ref=illumina.primers.fasta ktrim=r k=21 rcomp=t mink=11 hdist=2 minlen=20
 cutadapt -b file:illumina.primers.fasta -n 10 --rc -o cut.fa  -m 200 --cores=0 in.fa

Remaining offending sequences were counted by BlastN (2.9.0) vs UniVec

blastn -db UniVec -task blastn -outfmt "6 qacc qstart qend stitle" -reward 1 \
 -penalty -5 -gapopen 3 -gapextend 3 -dust yes -soft_masking true \
-evalue 700 -searchsp 1750000000000 \
 -query cleaned.fa -max_target_seqs 10 -num_threads 120 | grep -e"Illumina" | cut -f1 | sort -u | wc -l

Sequence statistiks were assessed with seqkit stats.

Python script to filter UniVec:

import sys
from Bio import SeqIO

for record in SeqIO.parse(sys.argv[1], "fasta"):
    if "Illumina" in record.description:
        print(record.format("fasta"))

Workflow

Here is the complete workflow in the hope that this be useful. Dependencies: cutadapt, seqkit, Python/BioPython, Perl/BioPerl (arrggh!). I would be happy about a (readable) one liner in seqkit or awk that can replace the perl script.

prepAsm4Genbank.sh:

#!/bin/sh
set -xeu

for asm in $@ ; do

 echo processing $asm
 cutadapt -b file:illumina.primers.fasta -n 10 --rc -o $asm.cut.fa  -m 200 --cores=0 $asm
 blastn -db UniVec -task blastn -outfmt "6 qacc stitle" -reward 1 \
 -penalty -5 -gapopen 3 -gapextend 3 -dust yes -soft_masking true -evalue 700 \
 -searchsp 1750000000000 -query $asm.cut.fa \
 -max_target_seqs 10 -num_threads 120 | grep -e"Illumina" | cut -f1 | sort -u  > $asm.delme

 seqkit grep -v -f $asm.delme  $asm.cut.fa | seqkit -is replace -p "^n+|n+$" -r "" | ./genbank-N-filter.pl | seqkit seq --min-len 200 -g  > $asm.genbank-cleaned.fa

done

genbank-N-filter.pl:

#!/usr/bin/env perl

use Bio::SeqIO;
my $seqio = Bio::SeqIO->new(-fh => \*STDIN, '-format' => 'Fasta');
my $out = Bio::SeqIO->new(-format => 'Fasta');

while(my $seq = $seqio->next_seq) {
    my $n;
    my $sub10r = substr $seq->seq, -10;
    my $sub10l = $seq->subseq (1, 10);
    my $sub50r = substr $seq->seq, -50;
    my $sub50l = $seq->subseq (1, 50);
    my $m = () = $sub10l =~ /[Nn]/gi;
    my $n = () = $sub10r =~ /[Nn]/gi;
    my $o = () = $sub50l =~ /[Nn]/gi;
    my $p = () = $sub50r =~ /[Nn]/gi;

    next if $m > 5 or $n > 5 or $o > 15 or $p > 15;
    $out->write_seq($seq);
}
__END__