Find motif in reads and delete everything before / extract everything after
1
0
Entering edit mode
3.2 years ago
zack.saud ▴ 50

Hi all,

I have sequence some plasmid DNA using Nanopore sequencing (whole plasmid was sequenced). I am only interested in one region, therefore I would like to extract from the fasta file (around 10000 reads) just 425 bp from a specified location in each 4600 bp read (i.e. everything after atgacccg to be kept, or delete everything before this sequence).

Would someone be as kind as to help me do this? I've tried a few tools, but they fail with the long nanopore reads.

next-gen sequencing • 715 views
2
Entering edit mode
3.2 years ago

Any tool using regular expression supporting groups ( a part of a regex typically wrapped in at ( ) that can be referenced with \1 later) should work.

Example using sed:

\$ cat my.fasta

TGTCATCGATCATCTGACTGATGCTGACTGACTGACTGATGCTGTGATGCATGCATGCTGACTGACTGACTGACTGatgacccgACCGGGTTTTAAAAACCCCCCGGGGGGGTTTTTTTTaaaaaaaaaaaaa


Extract the next 3 characters after atgacccg:

sed -e 's|.*atgacccg$$.\{3\}$$.*|\1|g' my.fasta
ACC


Here the syntax of sed means substitute anything that has atgacccg, capture the group with is .{3}, meaning 3 of anything (you would use 425 in your case) and then followed by anything. The second part with \1 means replace the matched string with group #1. In the case where there is not match (like your header), there is not replacement.

One caveat: Make sure your fasta sequence is continuous and not broken on multiple lines. See here: , you could adapt something like that by adding a pipe to sed before formatting back to fasta.

1
Entering edit mode

Hi, Many thanks for this, it worked like a charm! If I could ask for your help just one more time, the data is a little bit noisy, and it appears that it is retaining reads that do not contain this sequence. Would you happen to know how to modify the above script to discard any reads that do not contain the specified sequence? Many thanks again zack

0
Entering edit mode

When you linearize, you could prefix the input to the sed command with a grep for atgacccg, i.e.

grep "atgacccg"  my_linearized.fasta  |  sed -e 's|.*atgacccg$$.\{3\}$$.*|\1|g' my.fasta > only_matching_atgacccg.fasta


That way only sequences matching atgacccg are retained and processed by sed.