So here is a little benchmark bbduk.sh vs. cutadap. Both using all sequences having "Illumina" in their description from UniVec
file format type num_seqs sum_len min_len avg_len max_len remaining_blastn_hits user_time
original.fa FASTA DNA 179,617 129,380,317 200 720.3 22,125 9548
bbduk.fa FASTA DNA 175,618 125,033,934 200 712 22,125 1435 *2m33.636s*
cutdapt.fa FASTA DNA *175,777* *126,103,851* 200 717.4 22,109 *450* 90m49.085s
The remaining blastn hits need to be removed entirely or the GenBank uploader will not accept the assembly.
So, cutadapt filters and rescues more sequences but bbduk is much faster.
Methods
cutadapt (version 2.9) and bbduk (BBMap version 38.90) where run under Linux on a multi-cpu box using all sequences from the UniVec database ( https://www.ncbi.nlm.nih.gov/tools/vecscreen/univec/ ) matching "Illumina" in their FASTA header (seqs were extracted using python).
bbduk.sh in=in.fa out=bbduk.fa ref=illumina.primers.fasta ktrim=r k=21 rcomp=t mink=11 hdist=2 minlen=20
cutadapt -b file:illumina.primers.fasta -n 10 --rc -o cut.fa -m 200 --cores=0 in.fa
Remaining offending sequences were counted by BlastN (2.9.0) vs UniVec
blastn -db UniVec -task blastn -outfmt "6 qacc qstart qend stitle" -reward 1 \
-penalty -5 -gapopen 3 -gapextend 3 -dust yes -soft_masking true \
-evalue 700 -searchsp 1750000000000 \
-query cleaned.fa -max_target_seqs 10 -num_threads 120 | grep -e"Illumina" | cut -f1 | sort -u | wc -l
Sequence statistiks were assessed with seqkit stats
.
Python script to filter UniVec:
import sys
from Bio import SeqIO
for record in SeqIO.parse(sys.argv[1], "fasta"):
if "Illumina" in record.description:
print(record.format("fasta"))
Workflow
Here is the complete workflow in the hope that this be useful. Dependencies: cutadapt, seqkit, Python/BioPython, Perl/BioPerl (arrggh!). I would be happy about a (readable) one liner in seqkit or awk that can replace the perl script.
prepAsm4Genbank.sh:
#!/bin/sh
set -xeu
for asm in $@ ; do
echo processing $asm
cutadapt -b file:illumina.primers.fasta -n 10 --rc -o $asm.cut.fa -m 200 --cores=0 $asm
blastn -db UniVec -task blastn -outfmt "6 qacc stitle" -reward 1 \
-penalty -5 -gapopen 3 -gapextend 3 -dust yes -soft_masking true -evalue 700 \
-searchsp 1750000000000 -query $asm.cut.fa \
-max_target_seqs 10 -num_threads 120 | grep -e"Illumina" | cut -f1 | sort -u > $asm.delme
seqkit grep -v -f $asm.delme $asm.cut.fa | seqkit -is replace -p "^n+|n+$" -r "" | ./genbank-N-filter.pl | seqkit seq --min-len 200 -g > $asm.genbank-cleaned.fa
done
genbank-N-filter.pl:
#!/usr/bin/env perl
use Bio::SeqIO;
my $seqio = Bio::SeqIO->new(-fh => \*STDIN, '-format' => 'Fasta');
my $out = Bio::SeqIO->new(-format => 'Fasta');
while(my $seq = $seqio->next_seq) {
my $n;
my $sub10r = substr $seq->seq, -10;
my $sub10l = $seq->subseq (1, 10);
my $sub50r = substr $seq->seq, -50;
my $sub50l = $seq->subseq (1, 50);
my $m = () = $sub10l =~ /[Nn]/gi;
my $n = () = $sub10r =~ /[Nn]/gi;
my $o = () = $sub50l =~ /[Nn]/gi;
my $p = () = $sub50r =~ /[Nn]/gi;
next if $m > 5 or $n > 5 or $o > 15 or $p > 15;
$out->write_seq($seq);
}
__END__
Michael Dondrup : If you are willing give
bbduk.sh
from BBMap suite a try. A GUIDE is available. I am happy to help if needed.With the same blastn benchmark against UniVec, there are only 1435 left.
Blazing fast...
Can you remove
mink=
? Also addktrim=rl
. Isk=
set to less than 1/2 length of the smallest reference primer you have? If not can you do that and retry?Like so? Minimum length of Illumina primers is 22bp, so I tried with k=10
That removes a bit too much:
Whenever I set hdist > 0, it removes 100%
with
ktrim=rl
:I wonder if you have some adapter sequences at the beginning of sequences which is leading to removal of all sequences with you use a short
k
. This is an interesting problem and I would not mind playing with the data but if you are happy withcutadapt
result then that is fine. It would be curious to see if NCBI accepts your new submission.One could do something like following to
ktrim=
in both directions.Thank your for this comment! The NCBI uploader should accept any submission where my benchmark gives 0 hits, because that is what they are using. However, I suspect that if the adapters were not properly trimmed, I now have an assembly where contigs are mostly linked by adapters, was that what you meant. How could I check this?
Given the chained command, already the first step removes 100%.
That was not what I was thinking but I suppose that is a possibility. With
bbduk
once you find a k-mer that matches, in the trim mode all the sequence to the right of that match is eliminated (when you usektrim=r
). That is what is likely removing all of your sequences. This is generally fine since with Illumina adapters once you run into adapter sequence you can get rid of the rest (applies to primer dimers too) of the sequence to right.I hope no adapters remained in your data before assembly. That is why
bbduk
has a overlap modetbo
andtpe
which ensures that even smallest (down to 1 bp) contaminants are removed from original data.I have checked two more Trinity assembles I made with trimmomatic clipping and checked with fastqc for adapter content. The removed rates are similar to the assembly I have to upload, so I guess this is somewhat normal.
Let us know what NCBI says. They could be using
k-mer
matching instead of blast to check for contaminants.Almost :/ After a longer round of trial and error (remove trailing N's, but then some seqs become too short, so filter for length again...) only one error left:
contig_70602, Sequence has more than 5 Ns in the last 10 bases or more than 15 Ns in the last 50 bases
Argh, going to fix that too...
I couldn't figure out how to test the above condition in seqkit, so I made a quick perlhack. And the uploaded file is finally accepted by the upload checker. So the pipeline:
cutadapt -> blastn vs univec -> seqkit grep -> seqkit remove N -> perlscript -> seqkit min-size -> GenBank
seems to work for a single test-case.