Question: Find motif in reads and delete everything before / extract everything after
0
gravatar for zack.saud
5 months ago by
zack.saud10
zack.saud10 wrote:

Hi all,

I have sequence some plasmid DNA using Nanopore sequencing (whole plasmid was sequenced). I am only interested in one region, therefore I would like to extract from the fasta file (around 10000 reads) just 425 bp from a specified location in each 4600 bp read (i.e. everything after atgacccg to be kept, or delete everything before this sequence).

Would someone be as kind as to help me do this? I've tried a few tools, but they fail with the long nanopore reads.

Many thanks in advance

sequencing next-gen • 217 views
ADD COMMENTlink modified 5 months ago by manuel.belmadani1.2k • written 5 months ago by zack.saud10
2
gravatar for manuel.belmadani
5 months ago by
Canada
manuel.belmadani1.2k wrote:

Any tool using regular expression supporting groups ( a part of a regex typically wrapped in at ( ) that can be referenced with \1 later) should work.

Example using sed:

$ cat my.fasta

>header1
TGTCATCGATCATCTGACTGATGCTGACTGACTGACTGATGCTGTGATGCATGCATGCTGACTGACTGACTGACTGatgacccgACCGGGTTTTAAAAACCCCCCGGGGGGGTTTTTTTTaaaaaaaaaaaaa

Extract the next 3 characters after atgacccg:

sed -e 's|.*atgacccg\(.\{3\}\).*|\1|g' my.fasta
>header1
ACC

Here the syntax of sed means substitute anything that has atgacccg, capture the group with is .{3}, meaning 3 of anything (you would use 425 in your case) and then followed by anything. The second part with \1 means replace the matched string with group #1. In the case where there is not match (like your header), there is not replacement.

One caveat: Make sure your fasta sequence is continuous and not broken on multiple lines. See here: , you could adapt something like that by adding a pipe to sed before formatting back to fasta.

ADD COMMENTlink modified 5 months ago • written 5 months ago by manuel.belmadani1.2k
1

Hi, Many thanks for this, it worked like a charm! If I could ask for your help just one more time, the data is a little bit noisy, and it appears that it is retaining reads that do not contain this sequence. Would you happen to know how to modify the above script to discard any reads that do not contain the specified sequence? Many thanks again zack

ADD REPLYlink written 5 months ago by zack.saud10

When you linearize, you could prefix the input to the sed command with a grep for atgacccg, i.e.

grep "atgacccg"  my_linearized.fasta  |  sed -e 's|.*atgacccg\(.\{3\}\).*|\1|g' my.fasta > only_matching_atgacccg.fasta

That way only sequences matching atgacccg are retained and processed by sed.

ADD REPLYlink written 5 months ago by manuel.belmadani1.2k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1732 users visited in the last hour