how to split/cut/restrict DNA strand
1
1
Entering edit mode
4.7 years ago

My goal is to split a sequence at a specific site into two separate sequences. Searching for the site should be a bit fuzzy due to sequencing-pipeline (basecalling on MinION) error.

Example:

Assume a sequence as below. X, Y, Q and Z are sequence nucleotides not necessary for understanding the problem but are useful for demonstration purposes.

XXXXXXXXXXXXXXYYYYYACTCATAQQQQQQQQQZZZZZZZZZZZZZ
                   |-----|

I would like to find site ACTCATA (with fuzzy matching) and split the sequence into

XXXXXXXXXXXXXXYYYYY

and

QQQQQQQQQZZZZZZZZZZZZZ

with optionally discarding the matched sequence.

Bonus points if this is done on fastq files where data on quality of reads is also split into new strings.

This could probably be accomplished the pedestrian way in biopython but was wondering if I missed a tool that does what I describe above.

fasta fastq manipulation python R • 994 views
ADD COMMENT
1
Entering edit mode

Are you looking for adapter sequences? If so: https://github.com/rrwick/Porechop

ADD REPLY
0
Entering edit mode

Thank you @WouterDeCoster. I may end up using this in another part of the pipeline.

ADD REPLY
2
Entering edit mode
4.7 years ago

seqkit

$ cat read.fq
@seq
XXXXXXXXXXXXXXYYYYYACTCATAQQQQQQQQQZZZZZZZZZZZZZ
+
GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG

$ cat read.fq | seqkit locate -p ACTCATA
seqID   patternName     pattern strand  start   end     matched
seq     ACTCATA ACTCATA +       20      26      ACTCATA

$ cat read.fq | seqkit subseq -r 1:19
@seq
XXXXXXXXXXXXXXYYYYY
+
GGGGGGGGGGGGGGGGGGG

$ cat read.fq | seqkit subseq -r 27:-1
@seq
QQQQQQQQQZZZZZZZZZZZZZ
+
GGGGGGGGGGGGGGGGGGGGGG
ADD COMMENT
0
Entering edit mode

for a file full of reads?

ADD REPLY
0
Entering edit mode

Writing a script is the better way for more than one reads :P

ADD REPLY
0
Entering edit mode

Doesn't this expect an exact match?

ADD REPLY
1
Entering edit mode

regular expression (default) and motif containing degenerate bases like N(-d) are supported: http://bioinf.shenwei.me/seqkit/usage/#locate

ADD REPLY
0
Entering edit mode

I'm curious if it's possible, in context of locate target random N mismatches in a string?

ADD REPLY
0
Entering edit mode

seqkit locate is based on regular expression matching not local sequence alignment. So random mismatches can not be achieved.

ADD REPLY

Login before adding your answer.

Traffic: 1626 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6