Question: Perl Stript:Break Contigs Into Overlapping Sequences
0
gravatar for biolab
5.5 years ago by
biolab1.1k
biolab1.1k wrote:

Dear All,

I am a perl beginner. I have a fasta file with many contig sequences, and need to break these contigs into 2kb overlapping fragments (with overlap length of 100bp). Could anyone help to write a perl script for me, when you have spare time? I will greatly appreciate your help. MANY THANKS!

Biolab

perl bioinformatics • 2.9k views
ADD COMMENTlink modified 4.0 years ago • written 5.5 years ago by biolab1.1k
2

What have you written so far? You're not likely to get many people willing to just randomly write scripts for you.

ADD REPLYlink written 5.5 years ago by Devon Ryan89k

I am learning perl and am a true starter. Could you give some possible solutions to solving my question? THANKS A LOT! Biolab

ADD REPLYlink written 5.5 years ago by biolab1.1k

Well, how about I just give you a possible work-flow. You'll probably want to use bioperl to make life easier.

for contig in file
    contig_length = length of contig
    start_position = 0 #I assume bioperl is 0-based
    while(start_position < contig_length) {
        stop_position = start_position+1999
        if(stop_position >= contig_length) stop_position = contig_length-1
        sequence = extract subsequence given start/stop_position
        write sequence to file
        start_position += 1900
    }
}
ADD REPLYlink modified 5.5 years ago • written 5.5 years ago by Devon Ryan89k

Thank you very much!!

ADD REPLYlink written 5.5 years ago by biolab1.1k

No problem. Keep in mind that that general workflow might not produce exactly what you want at the end of a contig. You might want to just break out of a loop if the length of the subsequence is very short, since otherwise the last subsection could be contained entirely within the next to last subsection.

ADD REPLYlink written 5.5 years ago by Devon Ryan89k

This question have been asked before because I remember answering it. Search!

ADD REPLYlink written 5.5 years ago by Martin A Hansen3.0k
3
gravatar for brentp
5.3 years ago by
brentp23k
Salt Lake City, UT
brentp23k wrote:

not perl, but see pyfasta: https://pypi.python.org/pypi/pyfasta/#command-line-interface you can use it from the command-line as:

$ pyfasta split -n 1 -k 1000 -o 200 original.fasta
ADD COMMENTlink written 5.3 years ago by brentp23k

this one is really interesting. thanks!

ADD REPLYlink written 5.3 years ago by Pavel Senin1.9k
0
gravatar for biolab
4.0 years ago by
biolab1.1k
biolab1.1k wrote:
#!/usr/bin/perl
use strict;
use warnings;

$/ = ">";
open my $fasta_file, '<', $ARGV[0] or die $!;
my $omitted = <$fasta_file>;
while (<$fasta_file>) {
    s/\r//;
    chomp;
    my ($id, $seq) = /(.+?)\n(.+)/s or next;
    $seq =~ s/\n//g;
    break_into_segments($id, $seq);
}
close $fasta_file;

sub break_into_segments {
    my ($id, $seq) = @_;
    my $seq_len = length $seq;
    my $i;
    while ($seq_len > 2000) {
        $i++;
        my $extr_seq = substr($seq, 0, 2000);
        $seq = substr($seq,1900);
        $seq_len -= 1900;
        print ">$id\_$i\n$extr_seq\n";
    }
    $i++;
    print ">$id\_$i\n$seq\n";
}
ADD COMMENTlink modified 4.0 years ago • written 4.0 years ago by biolab1.1k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 715 users visited in the last hour