Removing UniVec sequences in Assembly
2
0
Entering edit mode
3.8 years ago
jfo ▴ 50

My first time uploading my de novo transcriptome assembly to TSA. I got flagged with Code(VECTOR_MATCH). I tried to trim the vector using bbduk and cutadapt, but I couldn't seem to make it work. For example, this adapter:

>gnl|uv|NGB00150.1:1-46 Ambion FirstChoice RLM-RACE 3' RACE adapter
GCGAGCACAGAATTAATACGACTCACTATAGGTTTTTTTTTTTTVN

It matches to my example sequence (BOLD) according to VecScreen.

> sample
GCAAAGAAGCATTTTGGCAAAAAATTGCGTAATATTCTGCCGTATGTTACTGCAATGTACACGTTTATAA
TTATTGTAATAAGAATGTCTCATATTGCCTGCTTGATGTGGCAGGGTCACTTGTCAAGTGAGGAAAAGTC
ACAGTGTGAGGACTGTCTATAAAAATTTAGGCATCATATTAAAATGTGTGGATGCCTTATTGTATAGAAT
ATTTCAAATTTTGCAAAATTTGAACAAAGCATATAAAATAAAAGGAACGAAATTGAAAAAAAAAAAAAAA
A**GTCGTATTAATTCTGTGCTCG**

My problem is: sequence trimmers could not recognize the adapter? I had to manually reverse complement the adapter and remove the "extra" sequences just to have that exact match on my sequence. Had it been 5 sequences, I can manually remove them; but, more than a thousand sequences were flagged. I am not sure what am I doing wrong. This pandemic is making me too exhausted to read more bioinformatics...

Assembly RNA-Seq TSA cutadapt bbduk • 1.1k views
ADD COMMENT
2
Entering edit mode
3.8 years ago
GenoMax 141k

I tried to trim the vector using bbduk and cutadapt, but I couldn't seem to make it work.

Using bbduk.sh that sequence is definitely detected and trimmed. Can you let us know how you used bbduk.sh?

$ more test.fa
>test
GCAAAGAAGCATTTTGGCAAAAAATTGCGTAATATTCTGCCGTATGTTACTGCAATGTACACGTTTATAATTATTGTAATAAGAATGTCTCATATTGCCTGCTTGATGTGGCAGGGTCACTTGTCAAGTGAGGAAAAGTCACAGTGTGAGGACTGTCTATAAAAATTTAGGCATCATATTAAAATGTGTGGATGCCTTATTGTATAGAATATTTCAAATTTTGCAAAATTTGAACAAAGCATATAAAATAAAAGGAACGAAATTGAAAAAAAAAAAAAAAAGTCGTATTAATTCTGTGCTCG

$ bbduk.sh in=test.fa literal=GCGAGCACAGAATTAATACGACTCACTATAGGT ktrim=r out=new.fa k=10

$ more new.fa
>test
GCAAAGAAGCATTTTGGCAAAAAATTGCGTAATATTCTGCCGTATGTTACTGCAATGTACACGTTTATAA
TTATTGTAATAAGAATGTCTCATATTGCCTGCTTGATGTGGCAGGGTCACTTGTCAAGTGAGGAAAAGTC
ACAGTGTGAGGACTGTCTATAAAAATTTAGGCATCATATTAAAATGTGTGGATGCCTTATTGTATAGAAT
ATTTCAAATTTTGCAAAATTTGAACAAAGCATATAAAATAAAAGGAACGAAATTGAAAAAAAAAAAAAAA

If you used the full adapter sequence then more sequence is removed.

$ bbduk.sh in=test.fa literal=GCGAGCACAGAATTAATACGACTCACTATAGGTTTTTTTTTTTTVN overwrite=t ktrim=r out=new.fa k=10

$ more new.fa
>test
GCAAAGAAGCATTTTGGCAAAAAATTGCGTAATATTCTGCCGTATGTTACTGCAATGTACACGTTTATAA
TTATTGTAATAAGAATGTCTCATATTGCCTGCTTGATGTGGCAGGGTCACTTGTCAAGTGAGGAAAAGTC
ACAGTGTGAGGACTGTCTATAAAAATTTAGGCATCATATTAAAATGTGTGGATGCCTTATTGTATAGAAT
ATTTCAAATTTTGCAAAATTTGAACAAAGCATA
ADD COMMENT
0
Entering edit mode

This worked. I used k=21. I am not sure why. But, thanks....

ADD REPLY
1
Entering edit mode

Please consider accepting this answer (green checkmark) to provide closure to this thread.

ADD REPLY
1
Entering edit mode
3.8 years ago

We find adapters in genomes all the time which is terrible for metagenomics.

This simple tool might help you to at least diagnose your problem and or replace the adapters with NNNs https://github.com/colindaven/blacklister

Try additional adapter trimmers like -trimmomatic -fastp

and after adapter trimming really check the results using FASTQC and multiqc.

ADD COMMENT

Login before adding your answer.

Traffic: 2307 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6