Perl Stript:Break Contigs Into Overlapping Sequences
2
0
Entering edit mode
9.1 years ago
biolab ★ 1.4k

Dear All,

I am a perl beginner. I have a fasta file with many contig sequences, and need to break these contigs into 2kb overlapping fragments (with overlap length of 100bp). Could anyone help to write a perl script for me, when you have spare time? I will greatly appreciate your help. MANY THANKS!

Biolab

perl bioinformatics • 4.0k views
ADD COMMENT
2
Entering edit mode

What have you written so far? You're not likely to get many people willing to just randomly write scripts for you.

ADD REPLY
0
Entering edit mode

I am learning perl and am a true starter. Could you give some possible solutions to solving my question? THANKS A LOT! Biolab

ADD REPLY
0
Entering edit mode

Well, how about I just give you a possible work-flow. You'll probably want to use bioperl to make life easier.

for contig in file
    contig_length = length of contig
    start_position = 0 #I assume bioperl is 0-based
    while(start_position < contig_length) {
        stop_position = start_position+1999
        if(stop_position >= contig_length) stop_position = contig_length-1
        sequence = extract subsequence given start/stop_position
        write sequence to file
        start_position += 1900
    }
}
ADD REPLY
0
Entering edit mode

Thank you very much!!

ADD REPLY
0
Entering edit mode

No problem. Keep in mind that that general workflow might not produce exactly what you want at the end of a contig. You might want to just break out of a loop if the length of the subsequence is very short, since otherwise the last subsection could be contained entirely within the next to last subsection.

ADD REPLY
0
Entering edit mode

This question have been asked before because I remember answering it. Search!

ADD REPLY
3
Entering edit mode
8.9 years ago
brentp 24k

not perl, but see pyfasta: https://pypi.python.org/pypi/pyfasta/#command-line-interface you can use it from the command-line as:

$ pyfasta split -n 1 -k 1000 -o 200 original.fasta
ADD COMMENT
0
Entering edit mode

this one is really interesting. thanks!

ADD REPLY
0
Entering edit mode
7.7 years ago
biolab ★ 1.4k
#!/usr/bin/perl
use strict;
use warnings;

$/ = ">";
open my $fasta_file, '<', $ARGV[0] or die $!;
my $omitted = <$fasta_file>;
while (<$fasta_file>) {
    s/\r//;
    chomp;
    my ($id, $seq) = /(.+?)\n(.+)/s or next;
    $seq =~ s/\n//g;
    break_into_segments($id, $seq);
}
close $fasta_file;

sub break_into_segments {
    my ($id, $seq) = @_;
    my $seq_len = length $seq;
    my $i;
    while ($seq_len > 2000) {
        $i++;
        my $extr_seq = substr($seq, 0, 2000);
        $seq = substr($seq,1900);
        $seq_len -= 1900;
        print ">$id\_$i\n$extr_seq\n";
    }
    $i++;
    print ">$id\_$i\n$seq\n";
}
ADD COMMENT

Login before adding your answer.

Traffic: 1914 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6