matched designed sequence in nanopore reads
1
0
Entering edit mode
4.4 years ago
Richard ▴ 590

Hi folks,

I have reads with a known construct from which I want to extract some subsequence. As an example, this sequence represents what I'm looking at:

aaaaaaaaaaaaaaaabbbbbbbbbbbbbbccccccccccddddddddeeeeeeeeeeeeeeeeeeee

where the 'a' bases are nanopore adapter sequence 'b' bases are another adapter sequence 'c' and 'd' are the bases I want to extract. The 'c' sequences should come from a set of known sequences while the 'd' sequences are of known length, but unkown content. 'e' should be either a polyA or polyT sequnce of unknown length.

I have tried analyzing this by using porechop to trim the nanopore adapter 'a', skewer (with high error tolerance) to trim 'b', then searching for a polyA or polyT sequence to then find c and d and match d to the known sequences.

This whole thing seems very heuristic, and if my polyA track has a base error near the interface with the 'd' sequence is likely to by off by a base or two.

I feel as though my determination of the 'c' and 'd' sequence would be better if I had a way to look for a sequence using a regex to match the known sequence surrounding my sequence of interest, but also with error tolerance.

What sort of approach would you try?

sequence alignment • 1.3k views
ADD COMMENT
0
Entering edit mode

I would recommend that you try bbduk.sh in filter mode with sequence you expect in literal= option. Here is a guide to get you started.

ADD REPLY
0
Entering edit mode
4.4 years ago
zubenel ▴ 120

I assumed that you were able to remove adapter sequences 'a' and 'b'. I used a sample file with sequences in format:

ccccccccccddddddddeeeeeeeeeeeeeeeeeeee

File was called input.txt and it is given below:

CCATTGAAGCAGTCTGATAACAAAAAGAAAATAAAAAA
ATTGTATCTGTGAGCACCAGAACAAAAATAAAAAAAAA

I have written Perl script that removes the longest polyA tract with error rate below some predefined value $allowed_error_rate. Also you need to choose a minimum possible length of polyA tract $min_len and a leftmost position from which polyA tract may start $left_pos.

use strict;
use warnings;

# Here it is written as input.txt file is in the same directory as a script
my $input = 'input.txt';

# Open input file with sequences    
unless(open(INPUT, $input)) {
    die "\nCannot open $input\n";
}

while(my $seq = <INPUT>) {

    chomp $seq; # removes newline character

    # Calculate length of sequence
    my $total_len  = length($seq);

    # Choose a minimum possible length of polyA sequence
    my $min_len = 10;

    # Calculate a rightmost position from which polyA sequence may start
    my $right_pos = $total_len - $min_len;

    # Choose a leftmost position from which polyA sequence may start
    my $left_pos = 15;

    for my $i ( $left_pos .. $right_pos ) {

        # Calculate length of a new subsequence
        my $new_len = $total_len - $i;

        # Extract subsequence starting with position nth (nth position equals seq[n-1]) 
        my $subseq = substr($seq, $i - 1, $new_len);

        # Create polyA sequence of the same length as substring
        my $polya = "A" x $new_len;

        # Choose allowed error rate in polyA sequence
        my $allowed_error_rate = 0.15;

        # Calculate actual error rate by comparing substring with polyA sequence
        my $error_rate = StringDiff( $polya, $subseq)/($new_len);

        # Remove the longest polyA sequence with error rate below allowed
        # Print the sequence that is left after removal of polyA
        if ($error_rate <= $allowed_error_rate) {

            my $match = substr($seq, 0, $i);
            print "$match\n";

            last;
        }
    }

}

close(INPUT);

# Function to calculate differences between strings
sub StringDiff {

    # This function needs two strings $s1 and $s2 of the same length
    my ($s1, $s2) = @_;

    # Strings are compared with ^ operator. 
    # It returns nul character if at some position strings contain the same character
    my $str = $s1 ^ $s2;

    # By matching $str with non-nul characters we can calculate number of differences 
    my $differences = 0;
    while ($str =~ /[^\0]/g) {
        $differences += 1;
    }
    return $differences;
}
ADD COMMENT

Login before adding your answer.

Traffic: 2694 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6