Question: How To Split A Long Dna Sequence Into Certain Length Parts By Perl/Python ?
1
gravatar for quge856
6.4 years ago by
quge85650
quge85650 wrote:

Hi there,

Here is a long DNA sequence (in fasta), would you like to show me how to split it into certain length fragments (100nt) with 20nt overlapping? Like following:

Input:

>E.coli  
ACTG*****************************

Output:

>E.coli(1-100)  
ACTG***********************  
>E.coli(80-180)  
*******************************  
>E.coli(160-260)  
*******************************

Thank you in advance!

perl split • 7.0k views
ADD COMMENTlink modified 6.4 years ago by brentp23k • written 6.4 years ago by quge85650
3

Would you like to tell us whether you tried to do this yourself, or if you don't know where to start with the problem?

ADD REPLYlink written 6.4 years ago by Neilfws48k

A script, like JC's answer. Thank u also.

ADD REPLYlink written 6.4 years ago by quge85650
10
gravatar for Martin A Hansen
6.4 years ago by
Martin A Hansen3.0k
Denmark
Martin A Hansen3.0k wrote:

This can be done with Biopieces www.biopieces.org) like this:

read_fasta -i data_in.fna | split_seq -w 100 -s 20 | write_fasta -o data_out.fna -x
ADD COMMENTlink written 6.4 years ago by Martin A Hansen3.0k
1

Thank u martinahansen, after reading the Biopieces introduction, i realized it's a very very powerful tool!

ADD REPLYlink modified 6.4 years ago • written 6.4 years ago by quge85650
9
gravatar for JC
6.4 years ago by
JC7.7k
Mexico
JC7.7k wrote:

Perl option:

#!/usr/bin/perl

use strict;
use warnings;

my $len = 100;
my $over = 20;
my ($seq_id, $seq);

while (<>) {
    chomp;
    if (m/^>/) { $seq_id = $_; } else { $seq .= $_; }
}

for (my $i = 1; $i <= length $seq; $i += ($len - $over)) {
    my $s = substr ($seq, $i - 1, $len);
    print "$seq_id ($i-", $i + (length $s) - 1, ")\n$s\n";
}
ADD COMMENTlink written 6.4 years ago by JC7.7k

Thank you JC, this is the script which exactly want, and it works very well. Thanks again.

ADD REPLYlink written 6.4 years ago by quge85650
7
gravatar for SES
6.4 years ago by
SES8.2k
Vancouver, BC
SES8.2k wrote:

This can easily be done with genometools as:

gt shredder -minlength 100 -maxlength 100 -overlap 20 ecoli.fasta > ecoli_shredded.fasta

Note that there are also -coverage and -sample options for shredder that will allow you to control how your fragments are generated. Another good option is dwgsim, which is capable of doing sampling with various kinds of mutations, but this may be more than what you need. The Biopieces (mentioned by martinahansen) or genometools solutions are probably more appropriate based on your question.

ADD COMMENTlink modified 6.4 years ago • written 6.4 years ago by SES8.2k
1

Thanks for your input. it looks also a useful tool besides Biopieces.

ADD REPLYlink written 6.4 years ago by quge85650
2
gravatar for brentp
6.4 years ago by
brentp23k
Salt Lake City, UT
brentp23k wrote:

You can use pyfasta to do this

pyfasta split -k 100 -o 20 input.fasta -n 1
ADD COMMENTlink written 6.4 years ago by brentp23k

Thank u! now I learn more skills from you guys. lol

ADD REPLYlink written 6.4 years ago by quge85650
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1142 users visited in the last hour