Question: Removing primer and a variable length adaptor/barcode sequence from .fna file
0
gravatar for caverill
14 months ago by
caverill40
caverill40 wrote:

I have a set of sequence files, where individual sequences look like this:

GACTACACGTAGTATATCTAGCGACTTCCTGCGCTTATTGATATGCTTAAGTTCAGCGGGTATCCCTACCTGATCCGAGGTCAACCTGAAAAATGGGGGTTCGTGCAGGTGCCGCCCGGGGTCGTGTAGCGAGGAGTATTACTACGCTTAGAGCCCGGCCGTACCGCCACTGCTTTTGTAGGCCCGCCAACCGGCGGTGCCCAACGACCCAGCGAGCTGGATTGGTTATAATGACGCTCGAACAGGCATGCCCCTCGGAATACCAAGGGGCGCAATGTGCGTTCAAAGATTCGATGATTCACTGAATTCTGCAATTCACATTACTTATCGCATTTCGCTACGTTCTTCATCGATGACGAGATACTAGGTGTGGTCGGCGTCTCTCAAGGCACACAGGGGATAGG

There is a primer sequence that has a few variable positions: TCCTSCGCTTATTGATATGC

However, this particular sequence has 26bp before the primer starts, that includes both an adaptor sequence and a barcode. In this case that sequence is: GACTACACGTAGTATATCTAGCGACT

So, the primer is always 20bp, but the leading adaptor and barcode can vary by 2bp in length, hence I can't just trim 46bp off the front. Is there a good way to handle this? Unforunately I dont have a file of adaptor and barcode sequences, just this information. It may be relevant that this is old 454 data.

primer • 545 views
ADD COMMENTlink modified 14 months ago • written 14 months ago by caverill40

TCCTSCGCTTATTGATATGC

Is that a typo?

What is your aim with this data? If you are just going to align it (depending on the aligner) the primer and tag can get softclipped.

ADD REPLYlink modified 14 months ago • written 14 months ago by WouterDeCoster41k
1

its intended to represent a variable position base (either a G or a C), these are old sequences (~2012), and I realize naming conventions for variable positions may have changed.

I want to remove the primer sequence and everything that comes before it. I do not plan to align these sequences, they are going to be compared to a reference database (UNITE) via RDP. They are amplicon fungal ITS2 sequences from soil.

ADD REPLYlink modified 14 months ago • written 14 months ago by caverill40

A second follow up: After running this command my sequence file looks like this:

>SRR1502564_598 SRR1502564.4200 ID1YYUI01ENB5O length=496 orig_bc=AAAAAAAAAAAA new_bc=AAAAAAAAAAAA bc_diffs=0
TTAAGTTCAGCGGGTAGTCTTACTTGATTTGAGATCGAGTTTTACAAGAGTCGTTTCCAACCCCTGTGAA
ATCCTGCATCAGTCAGCCAATACGGTCAGACTCCCTTTATGTTAGCTGCAGCAAAAGTAATAATCCGTTT
GACGGGACTAAATAGATATGCTTATAGCTCAGGAGAATGTCCAGCTGCACCTGCATTTCAAGCAACCCTC
CGCCGATCGTAAGCGACTGGTGTTGGGATTGCTCAAGTCCAAAGTCATTCCAGAAATAAATCAGAAATGT
CTTTGAGGTGTTTACTGATACTCAAACAAGCATGCTCTTCGGAATACCAAAGAGCGCAATATGCGTTCAA
AGATTCGATGATTCACTGAATTCTGCAATTCACATTACGTATCGCATTTCGCAGCGTTCTTCATCGATGA
CGAGTCTAGATACTAGGTGTGGTCGGCGTC

The sequence is actually broken up onto multiple lines, rather being written to a single line. This can be verified with wc -l filename.fna. This is going to be a problem downstream. Any ideas on how to fix this?

ADD REPLYlink modified 14 months ago by genomax72k • written 14 months ago by caverill40
1

It is ok to have fasta files with sequence wrapping around like it. What program are you planning to use for downstream analysis? If you do want to make them single line fasta then use the solution in this thread: Multiline Fasta To Single Line Fasta

ADD REPLYlink written 14 months ago by genomax72k

Gotcha. I do some manipulations using sed downstram which require the sequence line to be a single line. Thanks for this link.

ADD REPLYlink written 14 months ago by caverill40
2
gravatar for genomax
14 months ago by
genomax72k
United States
genomax72k wrote:

Use bbduk.sh from BBMap suite with literal=TCCTGCGCTTATTGATATGC,TCCTCCGCTTATTGATATGC ktrim=l k=10 (and any additional options you need). This will remove the primer and all the bases to the left. Here is a detailed guide for the program.

ADD COMMENTlink modified 14 months ago • written 14 months ago by genomax72k

This works so well! Thank you! I see that ktrim=l tells bbduk to trim everything to the left. Can you elaborate on what k=10 indicates?

ADD REPLYlink written 14 months ago by caverill40
1

That is k-mer size used for the initial match of the literal sequence (which is then extended outwards).

ADD REPLYlink written 14 months ago by genomax72k

Following up on this solution: what happens in the kmer is not found? Is it just passed to the output file? I am running this again to filter a sequence file that contains reads with multiple different primers. I would like to subset to a particular primer set of interest. To do this, i figured I could just use bbduk to trim the primer of interest, and discard reads that do not match the primer. Is there a simple way to do this with bbduk?

ADD REPLYlink written 9 months ago by caverill40
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1801 users visited in the last hour