Question

matched designed sequence in nanopore reads

0

Entering edit mode

4.4 years ago

Richard ▴ 590

Hi folks,

I have reads with a known construct from which I want to extract some subsequence. As an example, this sequence represents what I'm looking at:

aaaaaaaaaaaaaaaabbbbbbbbbbbbbbccccccccccddddddddeeeeeeeeeeeeeeeeeeee

where the 'a' bases are nanopore adapter sequence 'b' bases are another adapter sequence 'c' and 'd' are the bases I want to extract. The 'c' sequences should come from a set of known sequences while the 'd' sequences are of known length, but unkown content. 'e' should be either a polyA or polyT sequnce of unknown length.

I have tried analyzing this by using porechop to trim the nanopore adapter 'a', skewer (with high error tolerance) to trim 'b', then searching for a polyA or polyT sequence to then find c and d and match d to the known sequences.

This whole thing seems very heuristic, and if my polyA track has a base error near the interface with the 'd' sequence is likely to by off by a base or two.

I feel as though my determination of the 'c' and 'd' sequence would be better if I had a way to look for a sequence using a regex to match the known sequence surrounding my sequence of interest, but also with error tolerance.

What sort of approach would you try?

sequence alignment • 1.3k views

ADD COMMENT • link updated 4.4 years ago by zubenel ▴ 120 • written 4.4 years ago by Richard ▴ 590

0

Entering edit mode

I would recommend that you try bbduk.sh in filter mode with sequence you expect in literal= option. Here is a guide to get you started.

ADD REPLY • link 4.4 years ago by GenoMax 141k

score 0 · Answer 1 · 2019-12-06

I assumed that you were able to remove adapter sequences 'a' and 'b'. I used a sample file with sequences in format:

ccccccccccddddddddeeeeeeeeeeeeeeeeeeee

File was called input.txt and it is given below:

CCATTGAAGCAGTCTGATAACAAAAAGAAAATAAAAAA
ATTGTATCTGTGAGCACCAGAACAAAAATAAAAAAAAA

I have written Perl script that removes the longest polyA tract with error rate below some predefined value $allowed_error_rate. Also you need to choose a minimum possible length of polyA tract $min_len and a leftmost position from which polyA tract may start $left_pos.

use strict;
use warnings;

# Here it is written as input.txt file is in the same directory as a script
my $input = 'input.txt';

# Open input file with sequences    
unless(open(INPUT, $input)) {
    die "\nCannot open $input\n";
}

while(my $seq = <INPUT>) {

    chomp $seq; # removes newline character

    # Calculate length of sequence
    my $total_len  = length($seq);

    # Choose a minimum possible length of polyA sequence
    my $min_len = 10;

    # Calculate a rightmost position from which polyA sequence may start
    my $right_pos = $total_len - $min_len;

    # Choose a leftmost position from which polyA sequence may start
    my $left_pos = 15;

    for my $i ( $left_pos .. $right_pos ) {

        # Calculate length of a new subsequence
        my $new_len = $total_len - $i;

        # Extract subsequence starting with position nth (nth position equals seq[n-1]) 
        my $subseq = substr($seq, $i - 1, $new_len);

        # Create polyA sequence of the same length as substring
        my $polya = "A" x $new_len;

        # Choose allowed error rate in polyA sequence
        my $allowed_error_rate = 0.15;

        # Calculate actual error rate by comparing substring with polyA sequence
        my $error_rate = StringDiff( $polya, $subseq)/($new_len);

        # Remove the longest polyA sequence with error rate below allowed
        # Print the sequence that is left after removal of polyA
        if ($error_rate <= $allowed_error_rate) {

            my $match = substr($seq, 0, $i);
            print "$match\n";

            last;
        }
    }

}

close(INPUT);

# Function to calculate differences between strings
sub StringDiff {

    # This function needs two strings $s1 and $s2 of the same length
    my ($s1, $s2) = @_;

    # Strings are compared with ^ operator. 
    # It returns nul character if at some position strings contain the same character
    my $str = $s1 ^ $s2;

    # By matching $str with non-nul characters we can calculate number of differences 
    my $differences = 0;
    while ($str =~ /[^\0]/g) {
        $differences += 1;
    }
    return $differences;
}