Trim fastq after and before motif occurance
20 months ago

Hi everyone,

Is there any easy way to trim a fasta/fastq before and after a certain motif occurance?

As example, this would be my sequence ATGAAACCTTTGGGGCCCCAGTCAGCTC

My motif of interest would be: GGGGCCCC

I want to trim let's say 5bp 5' and 3bp 3' of the motif occurance which would give you: CCTTTGGGGCCCCAGT

I searched around a bit but could not find any fitting tool. Any ideas/suggestions?

You can probably adapt this solution in awk: Split a sequence in a fastq file

Because of the unique requirement here you are likely going to need to write something yourself. Trimming programs are generally setup to trim/discard sequences (to left or right) once a particular k-mer motif is found in the sequence.

Using bbduk.sh from BBMap suite you can filter out reads that contain the motif of interest by doing:

\$ bbmap/bbduk.sh literal=NNNNNGGGGCCCCNNNNN k=18 copyundefined in=tt.fq outm=stdout.fq minlen=5


You can then work on that reduced dataset.

20 months ago

Just tested seqkit amplicon which actually did exactly that (option is only available in the pre-release of version v0.11.0 so far: https://github.com/shenwei356/seqkit/releases/tag/v0.11.0-dev)

Corresponding command would be:

seqkit amplicon input.fastq -F GGGGCCCC -r -5:3 -f -o output.fastq