Question

Split sequence according the 'N' base

0

Entering edit mode

7.0 years ago

Alex ▴ 50

Hi, I have some sequence and want to split the sequence by the N base using python or biopython, but I have not good idea how to split it. The sequence like this:

>chr1
ATCGCTGATCGATNCTAGCTAGCTAG
CGTAGCTGCTAGNCGTAGCCGTAGCT

what I need like this

>chr1_1

ATCGCTGATCGAT

>chr1_2

CTAGCTAGCTAGCGTAGCTGCTAG

>chr1_3

CGTAGCCGTAGCT

have any idea about it ?

Thanks

Alex

sequence • 3.9k views

ADD COMMENT • link updated 6.9 years ago by Biostar 20 • written 7.0 years ago by Alex ▴ 50

0

Entering edit mode

What do you want to happen when you reach 100 Ns in a row?

ADD REPLY • link 7.0 years ago by d-cameron ★ 2.9k

0

Entering edit mode

Sorry,I have forget it. if there exist many N base , we also want to split it util there have 'A','T','G'or 'C' base

ADD REPLY • link 7.0 years ago by Alex ▴ 50

2

Entering edit mode

7.0 years ago

microfuge ★ 2.0k

Another approach based on UCSC utils (http://hgdownload.soe.ucsc.edu/admin/exe/linux.x86_64/) and bedtools . Only suitable for bigger files.

Convert fasta to twobit

faToTwoBit genome.fa genome.2bit

Get the locations of Ns in bed form

twoBitInfo -nBed genome.2bit N.bed

Use bedtools to get compliment of Ns and bedtools getfasta to get fasta for the compliments (The genome file can be generated with samtools faidx and cut ) samtools faidx genome.fa; cut -f 1,2 genome.fa.fai > genome.genome

bedtools complement -i N.bed -g genome.genome |bedtools getfasta -fi genome.fa -bed -

Advantage being speed and also getting the coordinates The output -

>chr1:0-13
ATCGCTGATCGAT
>chr1:14-38
CTAGCTAGCTAGCGTAGCTGCTAG
>chr1:39-52
CGTAGCCGTAGCT

ADD COMMENT • link 7.0 years ago by microfuge ★ 2.0k

score 4 · Accepted Answer · 2018-06-12

4

Entering edit mode

7.0 years ago

lakhujanivijay 5.9k

A python solution

Dependencies: Biopython

#!/usr/bin/python2.7

from Bio import SeqIO
import getopt,sys,re


def usage():
    print "Usage: split.py -i <input_scaffold_fasta> -o <output_contig_fasta>"

try:
    options, remainder=getopt.getopt(sys.argv[1:], 'i:o:h')

except getopt.GetoptError as err:
    print str(err)
    usage()
    sys.exit()

for opt, arg in options:
    if opt in ('-i'):
        input_file=arg
    if opt in ('-h'):
        usage()
    sys.exit()
    elif opt in ('-o'):
        output_file=arg

out=open(output_file, 'w')

sequence = ''.join([str(record.seq).strip() for record in SeqIO.parse(input_file, "fasta")])

m=re.sub('[nN]+','\n',sequence).split('\n')

for i in range(0,len(m)):
    out.write('>contig_'+str(i)+'\n')
    out.write(m[i]+'\n')

Usage

[comp]$ python split.py -i input_fasta_file.fa -o output_fasta.fa

Output

>contig_0
ATCGCTGATCGAT
>contig_1
CTAGCTAGCTAGCGTAGCTGCTAG
>contig_2
CGTAGCCGTAGCT

ADD COMMENT • link 7.0 years ago by lakhujanivijay 5.9k

0

Entering edit mode

Thanks, looks great

Aelx

ADD REPLY • link 7.0 years ago by Alex ▴ 50

0

Entering edit mode

Alex, you should upvote OR accept a post as answer if any one or several posts helped or answered your question

click here to know how

Remember, you can accept multiple answers, as long as they provide valid solution to your top level problem. Accepting an answer is your single most important contribution to the community, as it helps others that face the same problem you did.

ADD REPLY • link 7.0 years ago by lakhujanivijay 5.9k

score 4 · Accepted Answer · 2018-06-12

just for fun:

#!/usr/bin/perl

use strict;
use warnings;

$/ = "\n>";

while (<>) {
    s/>//g;
    my ($id, @seq) = split (/\n+/, $_);
    my @slices = split (/N+/, join("", @seq));
    my $num = 0;
    foreach my $seq (@slices) {
        $num++;
        print ">$id\_$num\n";
        while ($seq) {
            print substr($seq, 0, 80), "\n";
            substr($seq, 0, 80) = '';
        }
    }
}

Run:

$ perl split_by_Ns.pl < test.fa
>chr1_1
ATCGCTGATCGAT
>chr1_2
CTAGCTAGCTAGCGTAGCTGCTAG
>chr1_3
CGTAGCCGTAGCT

score 4 · Accepted Answer · 2018-06-12

Here's a way that doesn't use slow Python libraries, works with repeated stretches of Ns, and should be agnostic to Python 2 or 3:

#!/usr/bin/env python

import sys
import re

not_first = False

def process_seq(header, seq):
    # reduce repeated Ns                                                                                                                                                                                                                          
    seq = re.sub(r'[^\w\s]|(.)(?=\1)', '', seq)
    # get indices of Ns                                                                                                                                                                                                                           
    match_idx = 1
    posn_idx = 0
    for match in re.finditer(r'N(.*?)', seq):
        sys.stdout.write("%s_%d\n%s\n" % (header, match_idx, seq[posn_idx:match.start()]))
        posn_idx = match.end()
        match_idx += 1
    sys.stdout.write("%s_%d\n%s\n" % (header, match_idx, seq[posn_idx:]))

for line in sys.stdin:
    if line.startswith('>'):
        if not_first:
            process_seq(header, seq)
        header = line.rstrip()
        seq = ''
        not_first = True
    else:
        seq += line.rstrip()

if not_first:
    process_seq(header, seq)

Output:

$ ./split_on_Ns.py < input.fa
>chr1_1
ATCGCTGATCGAT
>chr1_2
CTAGCTAGCTAGCGTAGCTGCTAG
>chr1_3
CGTAGCGTAGC

score 3 · Accepted Answer · 2018-06-12

using awk: can handle multiple Ns, case insensitive for Ns:

$  awk '{gsub("[Nn]+","\n>\n");}1' test.fa | awk '/^>/ {if ($0 == ">") {$0=prev} prev=$0}1' | awk '/^>/ {getline seq} {if(seq!="") {print $0"\n"seq}}' | awk '(/^>/ && a[$0]++) {$0=$0"_"a[$0]}1' 
>chr1
ATCGCTGATCGAT
>chr1_2
CTAGCTAGCTAGCGTAGCTGCTAG
>chr1_3
CGTAGCCGTAGCT
>chr2
CCTGA
>chr2_2
GTAGT
>chr3
ATG
>chr4
C
>chr4_2
T

input: $ cat test.fa

>chr1
ATCGCTGATCGATNCTAGCTAGCTAGCGTAGCTGCTAGNCGTAGCCGTAGCT
>chr2
CCTGANGTAGT
>chr3
ATGNNNNNN
>chr4
NNNNCnT

make sure that sequences are single line. If they are not, use following command instead of above line:

$ seqkit seq -w0 test.fa | awk '{gsub("[Nn]+","\n>\n");}1' | awk '/^>/ {if ($0 == ">") {$0=prev} prev=$0}1' | awk '/^>/ {getline seq} {if(seq!="") {print $0"\n"seq}}' | awk '(/^>/ && a[$0]++) {$0=$0"_"a[$0]}1'

Download seqkit from here.