Question: Removing UniVec sequences in Assembly
0
gravatar for jfo
7 weeks ago by
jfo20
jfo20 wrote:

My first time uploading my de novo transcriptome assembly to TSA. I got flagged with Code(VECTOR_MATCH). I tried to trim the vector using bbduk and cutadapt, but I couldn't seem to make it work. For example, this adapter:

>gnl|uv|NGB00150.1:1-46 Ambion FirstChoice RLM-RACE 3' RACE adapter
GCGAGCACAGAATTAATACGACTCACTATAGGTTTTTTTTTTTTVN

It matches to my example sequence (BOLD) according to VecScreen.

> sample
GCAAAGAAGCATTTTGGCAAAAAATTGCGTAATATTCTGCCGTATGTTACTGCAATGTACACGTTTATAA
TTATTGTAATAAGAATGTCTCATATTGCCTGCTTGATGTGGCAGGGTCACTTGTCAAGTGAGGAAAAGTC
ACAGTGTGAGGACTGTCTATAAAAATTTAGGCATCATATTAAAATGTGTGGATGCCTTATTGTATAGAAT
ATTTCAAATTTTGCAAAATTTGAACAAAGCATATAAAATAAAAGGAACGAAATTGAAAAAAAAAAAAAAA
A**GTCGTATTAATTCTGTGCTCG**

My problem is: sequence trimmers could not recognize the adapter? I had to manually reverse complement the adapter and remove the "extra" sequences just to have that exact match on my sequence. Had it been 5 sequences, I can manually remove them; but, more than a thousand sequences were flagged. I am not sure what am I doing wrong. This pandemic is making me too exhausted to read more bioinformatics...

ADD COMMENTlink modified 6 weeks ago by genomax87k • written 7 weeks ago by jfo20
2
gravatar for genomax
6 weeks ago by
genomax87k
United States
genomax87k wrote:

I tried to trim the vector using bbduk and cutadapt, but I couldn't seem to make it work.

Using bbduk.sh that sequence is definitely detected and trimmed. Can you let us know how you used bbduk.sh?

$ more test.fa
>test
GCAAAGAAGCATTTTGGCAAAAAATTGCGTAATATTCTGCCGTATGTTACTGCAATGTACACGTTTATAATTATTGTAATAAGAATGTCTCATATTGCCTGCTTGATGTGGCAGGGTCACTTGTCAAGTGAGGAAAAGTCACAGTGTGAGGACTGTCTATAAAAATTTAGGCATCATATTAAAATGTGTGGATGCCTTATTGTATAGAATATTTCAAATTTTGCAAAATTTGAACAAAGCATATAAAATAAAAGGAACGAAATTGAAAAAAAAAAAAAAAAGTCGTATTAATTCTGTGCTCG

$ bbduk.sh in=test.fa literal=GCGAGCACAGAATTAATACGACTCACTATAGGT ktrim=r out=new.fa k=10

$ more new.fa
>test
GCAAAGAAGCATTTTGGCAAAAAATTGCGTAATATTCTGCCGTATGTTACTGCAATGTACACGTTTATAA
TTATTGTAATAAGAATGTCTCATATTGCCTGCTTGATGTGGCAGGGTCACTTGTCAAGTGAGGAAAAGTC
ACAGTGTGAGGACTGTCTATAAAAATTTAGGCATCATATTAAAATGTGTGGATGCCTTATTGTATAGAAT
ATTTCAAATTTTGCAAAATTTGAACAAAGCATATAAAATAAAAGGAACGAAATTGAAAAAAAAAAAAAAA

If you used the full adapter sequence then more sequence is removed.

$ bbduk.sh in=test.fa literal=GCGAGCACAGAATTAATACGACTCACTATAGGTTTTTTTTTTTTVN overwrite=t ktrim=r out=new.fa k=10

$ more new.fa
>test
GCAAAGAAGCATTTTGGCAAAAAATTGCGTAATATTCTGCCGTATGTTACTGCAATGTACACGTTTATAA
TTATTGTAATAAGAATGTCTCATATTGCCTGCTTGATGTGGCAGGGTCACTTGTCAAGTGAGGAAAAGTC
ACAGTGTGAGGACTGTCTATAAAAATTTAGGCATCATATTAAAATGTGTGGATGCCTTATTGTATAGAAT
ATTTCAAATTTTGCAAAATTTGAACAAAGCATA
ADD COMMENTlink modified 6 weeks ago • written 6 weeks ago by genomax87k

This worked. I used k=21. I am not sure why. But, thanks....

ADD REPLYlink written 6 weeks ago by jfo20
1

Please consider accepting this answer (green checkmark) to provide closure to this thread.

ADD REPLYlink written 6 weeks ago by genomax87k
1
gravatar for colindaven
6 weeks ago by
colindaven2.3k
Hannover Medical School
colindaven2.3k wrote:

We find adapters in genomes all the time which is terrible for metagenomics.

This simple tool might help you to at least diagnose your problem and or replace the adapters with NNNs https://github.com/colindaven/blacklister

Try additional adapter trimmers like -trimmomatic -fastp

and after adapter trimming really check the results using FASTQC and multiqc.

ADD COMMENTlink written 6 weeks ago by colindaven2.3k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1519 users visited in the last hour