Question

how to split/cut/restrict DNA strand

1

Entering edit mode

7.2 years ago

Roman Luštrik ▴ 130

My goal is to split a sequence at a specific site into two separate sequences. Searching for the site should be a bit fuzzy due to sequencing-pipeline (basecalling on MinION) error.

Example:

Assume a sequence as below. X, Y, Q and Z are sequence nucleotides not necessary for understanding the problem but are useful for demonstration purposes.

XXXXXXXXXXXXXXYYYYYACTCATAQQQQQQQQQZZZZZZZZZZZZZ
                   |-----|

I would like to find site ACTCATA (with fuzzy matching) and split the sequence into

XXXXXXXXXXXXXXYYYYY

and

QQQQQQQQQZZZZZZZZZZZZZ

with optionally discarding the matched sequence.

Bonus points if this is done on fastq files where data on quality of reads is also split into new strings.

This could probably be accomplished the pedestrian way in biopython but was wondering if I missed a tool that does what I describe above.

fasta fastq manipulation python R • 1.7k views

ADD COMMENT • link updated 7.2 years ago by shenwei356 8.4k • written 7.2 years ago by Roman Luštrik ▴ 130

1

Entering edit mode

Are you looking for adapter sequences? If so: https://github.com/rrwick/Porechop

ADD REPLY • link 7.2 years ago by WouterDeCoster 47k

0

Entering edit mode

Thank you @WouterDeCoster. I may end up using this in another part of the pipeline.

ADD REPLY • link 7.2 years ago by Roman Luštrik ▴ 130

score 2 · Accepted Answer · 2017-02-16

2

Entering edit mode

7.2 years ago

shenwei356 8.4k

seqkit

$ cat read.fq
@seq
XXXXXXXXXXXXXXYYYYYACTCATAQQQQQQQQQZZZZZZZZZZZZZ
+
GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG

$ cat read.fq | seqkit locate -p ACTCATA
seqID   patternName     pattern strand  start   end     matched
seq     ACTCATA ACTCATA +       20      26      ACTCATA

$ cat read.fq | seqkit subseq -r 1:19
@seq
XXXXXXXXXXXXXXYYYYY
+
GGGGGGGGGGGGGGGGGGG

$ cat read.fq | seqkit subseq -r 27:-1
@seq
QQQQQQQQQZZZZZZZZZZZZZ
+
GGGGGGGGGGGGGGGGGGGGGG

ADD COMMENT • link 7.2 years ago by shenwei356 8.4k

0

Entering edit mode

for a file full of reads?

ADD REPLY • link 7.2 years ago by GenoMax 141k

0

Entering edit mode

Writing a script is the better way for more than one reads :P

ADD REPLY • link 7.2 years ago by shenwei356 8.4k

0

Entering edit mode

Doesn't this expect an exact match?

ADD REPLY • link 7.2 years ago by WouterDeCoster 47k

1

Entering edit mode

regular expression (default) and motif containing degenerate bases like N(-d) are supported: http://bioinf.shenwei.me/seqkit/usage/#locate

ADD REPLY • link 7.2 years ago by shenwei356 8.4k

0

Entering edit mode

I'm curious if it's possible, in context of locate target random N mismatches in a string?

ADD REPLY • link 7.2 years ago by Roman Luštrik ▴ 130

0

Entering edit mode

seqkit locate is based on regular expression matching not local sequence alignment. So random mismatches can not be achieved.

ADD REPLY • link 7.2 years ago by shenwei356 8.4k