Question

How to split sequence in fasta file based on long runs of Ns?

0

Entering edit mode

4.2 years ago

stacy734 ▴ 40

Hi all, I have a fasta file with some long runs of Ns. I would like to split the sequences with strings of 100 Ns or longer by replacing these strings with a header: >some_string It isn't necessary to have the header strings be unique, I can renumber the headers afterwards. One issue: the fasta is normally formatted for 70 nt in a line, so the 100 Ns might be spread over multiple lines. Can anyone suggest a unix script or software package? Thanks for any advice. Stacy

fasta unix • 1.8k views

ADD COMMENT • link updated 2.9 years ago by shriram • 0 • written 4.2 years ago by stacy734 ▴ 40

0

Entering edit mode

Thank you! This is very useful.

With appreciation, Stacy

ADD REPLY • link 4.2 years ago by stacy734 ▴ 40

0

Entering edit mode

Please consider upvoting and accepting answers if your problem was solved.

ADD REPLY • link 4.2 years ago by ATpoint 82k

score 1 · Answer 1 · 2020-03-02

If you install bioperl module Bio::SeqIO you could use the script below. Just save it as split_fasta.pl and run like ./split_fasta input.fasta output.fasta 100. In this way sequences from input.fasta file will be divided by 100 Ns and saved to file output.fasta.

If the sequence called "some_string" is splitted to 3 new sequences these are named as: "some_string_1", "some_string_2", "some_string_3".

#!/usr/bin/perl
use strict;
use warnings;
use Bio::SeqIO;

# Quit unless 3 command-line arguments are provided
my $num_args = $#ARGV + 1;
if ($num_args != 3) {
    print "\nUsage: split_fasta.pl input_fasta output_fasta N_length\n";
    exit;
}
my $input_fasta  = $ARGV[0];
my $output_fasta = $ARGV[1];
my $N_length     = $ARGV[2];

# Reading the input fasta file
my $seqio_in = Bio::SeqIO->new(-file => $input_fasta, 
                               -format => "fasta" );

# Creating the output fasta file                             
my $seqio_out = Bio::SeqIO->new(-file => ">$output_fasta", 
                                -format => "fasta" );

while ( my $seq = $seqio_in->next_seq ) {

    # Split fasta sequence by Ns of length at least $N_length
    my @splitted_seqs = split /[Nn]{$N_length,}/, $seq->seq;
    my $count = 0;
    my $old_name = $seq->display_id;

    # Save original sequence to output fasta if it was not splitted
    if ($#splitted_seqs == 0) {
        $seqio_out->write_seq($seq);            
    }

    # Save splitted sequences to output fasta by adding number to sequence name
    else {
        foreach my $new_seq (@splitted_seqs) {
            my $new_name = $old_name . "_$count";
            $seq->display_id($new_name);
            $seq->seq($new_seq);
            $seqio_out->write_seq($seq);
            $count++;       
        }
    }

}

score 0 · Answer 2 · 2020-03-01

0

Entering edit mode

4.2 years ago

boaty ▴ 220

I suggest you to use SeqIO.par in biopython or read.fasta() in sequinr. Those tools will transform fasta 4 lines format into a list of object which concatenate fasta sequence line into a string. Then just use split method to cut them

ADD COMMENT • link 4.2 years ago by boaty ▴ 220

score 0 · Answer 3 · 2020-03-01

0

Entering edit mode

4.2 years ago

GenoMax 141k

Prior answer of interest: A: split fastq sequence by motif (string of Ns)

You can linearize the fasta files by using some code from @Pierre found here:

ADD COMMENT • link 4.2 years ago by GenoMax 141k