Question: Removing primer and a variable length adaptor/barcode sequence from .fna file
gravatar for caverill
2.2 years ago by
caverill40 wrote:

I have a set of sequence files, where individual sequences look like this:


There is a primer sequence that has a few variable positions: TCCTSCGCTTATTGATATGC

However, this particular sequence has 26bp before the primer starts, that includes both an adaptor sequence and a barcode. In this case that sequence is: GACTACACGTAGTATATCTAGCGACT

So, the primer is always 20bp, but the leading adaptor and barcode can vary by 2bp in length, hence I can't just trim 46bp off the front. Is there a good way to handle this? Unforunately I dont have a file of adaptor and barcode sequences, just this information. It may be relevant that this is old 454 data.

primer • 1.1k views
ADD COMMENTlink modified 2.2 years ago • written 2.2 years ago by caverill40


Is that a typo?

What is your aim with this data? If you are just going to align it (depending on the aligner) the primer and tag can get softclipped.

ADD REPLYlink modified 2.2 years ago • written 2.2 years ago by WouterDeCoster44k

its intended to represent a variable position base (either a G or a C), these are old sequences (~2012), and I realize naming conventions for variable positions may have changed.

I want to remove the primer sequence and everything that comes before it. I do not plan to align these sequences, they are going to be compared to a reference database (UNITE) via RDP. They are amplicon fungal ITS2 sequences from soil.

ADD REPLYlink modified 2.2 years ago • written 2.2 years ago by caverill40

A second follow up: After running this command my sequence file looks like this:

>SRR1502564_598 SRR1502564.4200 ID1YYUI01ENB5O length=496 orig_bc=AAAAAAAAAAAA new_bc=AAAAAAAAAAAA bc_diffs=0

The sequence is actually broken up onto multiple lines, rather being written to a single line. This can be verified with wc -l filename.fna. This is going to be a problem downstream. Any ideas on how to fix this?

ADD REPLYlink modified 2.2 years ago by genomax91k • written 2.2 years ago by caverill40

It is ok to have fasta files with sequence wrapping around like it. What program are you planning to use for downstream analysis? If you do want to make them single line fasta then use the solution in this thread: Multiline Fasta To Single Line Fasta

ADD REPLYlink written 2.2 years ago by genomax91k

Gotcha. I do some manipulations using sed downstram which require the sequence line to be a single line. Thanks for this link.

ADD REPLYlink written 2.2 years ago by caverill40
gravatar for genomax
2.2 years ago by
United States
genomax91k wrote:

Use from BBMap suite with literal=TCCTGCGCTTATTGATATGC,TCCTCCGCTTATTGATATGC ktrim=l k=10 (and any additional options you need). This will remove the primer and all the bases to the left. Here is a detailed guide for the program.

ADD COMMENTlink modified 2.2 years ago • written 2.2 years ago by genomax91k

This works so well! Thank you! I see that ktrim=l tells bbduk to trim everything to the left. Can you elaborate on what k=10 indicates?

ADD REPLYlink written 2.2 years ago by caverill40

That is k-mer size used for the initial match of the literal sequence (which is then extended outwards).

ADD REPLYlink written 2.2 years ago by genomax91k

Following up on this solution: what happens in the kmer is not found? Is it just passed to the output file? I am running this again to filter a sequence file that contains reads with multiple different primers. I would like to subset to a particular primer set of interest. To do this, i figured I could just use bbduk to trim the primer of interest, and discard reads that do not match the primer. Is there a simple way to do this with bbduk?

ADD REPLYlink written 21 months ago by caverill40
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1244 users visited in the last hour