Question: Split sequence according the 'N' base
0
gravatar for Alex
13 days ago by
Alex20
America
Alex20 wrote:

Hi, I have some sequence and want to split the sequence by the N base using python or biopython, but I have not good idea how to split it. The sequence like this:

>chr1
ATCGCTGATCGATNCTAGCTAGCTAG
CGTAGCTGCTAGNCGTAGCCGTAGCT

what I need like this

>chr1_1

ATCGCTGATCGAT

>chr1_2

CTAGCTAGCTAGCGTAGCTGCTAG

>chr1_3

CGTAGCCGTAGCT

have any idea about it ?

Thanks

Alex

sequence • 180 views
ADD COMMENTlink modified 12 days ago by cpad01126.4k • written 13 days ago by Alex20

What do you want to happen when you reach 100 Ns in a row?

ADD REPLYlink written 13 days ago by d-cameron1.8k

Sorry,I have forget it. if there exist many N base , we also want to split it util there have 'A','T','G'or 'C' base

ADD REPLYlink written 13 days ago by Alex20
3
gravatar for JC
13 days ago by
JC6.7k
Mexico
JC6.7k wrote:

just for fun:

#!/usr/bin/perl

use strict;
use warnings;

$/ = "\n>";

while (<>) {
    s/>//g;
    my ($id, @seq) = split (/\n+/, $_);
    my @slices = split (/N+/, join("", @seq));
    my $num = 0;
    foreach my $seq (@slices) {
        $num++;
        print ">$id\_$num\n";
        while ($seq) {
            print substr($seq, 0, 80), "\n";
            substr($seq, 0, 80) = '';
        }
    }
}

Run:

$ perl split_by_Ns.pl < test.fa
>chr1_1
ATCGCTGATCGAT
>chr1_2
CTAGCTAGCTAGCGTAGCTGCTAG
>chr1_3
CGTAGCCGTAGCT
ADD COMMENTlink written 13 days ago by JC6.7k
2
gravatar for Alex Reynolds
12 days ago by
Alex Reynolds24k
Seattle, WA USA
Alex Reynolds24k wrote:

Here's a way that doesn't use slow Python libraries, works with repeated stretches of Ns, and should be agnostic to Python 2 or 3:

#!/usr/bin/env python

import sys
import re

not_first = False

def process_seq(header, seq):
    # reduce repeated Ns                                                                                                                                                                                                                          
    seq = re.sub(r'[^\w\s]|(.)(?=\1)', '', seq)
    # get indices of Ns                                                                                                                                                                                                                           
    match_idx = 1
    posn_idx = 0
    for match in re.finditer(r'N(.*?)', seq):
        sys.stdout.write("%s_%d\n%s\n" % (header, match_idx, seq[posn_idx:match.start()]))
        posn_idx = match.end()
        match_idx += 1
    sys.stdout.write("%s_%d\n%s\n" % (header, match_idx, seq[posn_idx:]))

for line in sys.stdin:
    if line.startswith('>'):
        if not_first:
            process_seq(header, seq)
        header = line.rstrip()
        seq = ''
        not_first = True
    else:
        seq += line.rstrip()

if not_first:
    process_seq(header, seq)

Output:

$ ./split_on_Ns.py < input.fa
>chr1_1
ATCGCTGATCGAT
>chr1_2
CTAGCTAGCTAGCGTAGCTGCTAG
>chr1_3
CGTAGCGTAGC
ADD COMMENTlink modified 12 days ago • written 12 days ago by Alex Reynolds24k
1
gravatar for Vijay Lakhujani
13 days ago by
Vijay Lakhujani2.5k
India
Vijay Lakhujani2.5k wrote:

A python solution

Dependencies: Biopython

#!/usr/bin/python2.7

from Bio import SeqIO
import getopt,sys,re


def usage():
    print "Usage: split.py -i <input_scaffold_fasta> -o <output_contig_fasta>"

try:
    options, remainder=getopt.getopt(sys.argv[1:], 'i:o:h')

except getopt.GetoptError as err:
    print str(err)
    usage()
    sys.exit()

for opt, arg in options:
    if opt in ('-i'):
        input_file=arg
    if opt in ('-h'):
        usage()
    sys.exit()
    elif opt in ('-o'):
        output_file=arg

out=open(output_file, 'w')

sequence = ''.join([str(record.seq).strip() for record in SeqIO.parse(input_file, "fasta")])

m=re.sub('[nN]+','\n',sequence).split('\n')

for i in range(0,len(m)):
    out.write('>contig_'+str(i)+'\n')
    out.write(m[i]+'\n')

Usage

[comp]$ python split.py -i input_fasta_file.fa -o output_fasta.fa

Output

>contig_0
ATCGCTGATCGAT
>contig_1
CTAGCTAGCTAGCGTAGCTGCTAG
>contig_2
CGTAGCCGTAGCT
ADD COMMENTlink modified 13 days ago • written 13 days ago by Vijay Lakhujani2.5k

Thanks, looks great

Aelx

ADD REPLYlink written 13 days ago by Alex20

Alex, you should upvote OR accept a post as answer if any one or several posts helped or answered your question

click here to know how

Remember, you can accept multiple answers, as long as they provide valid solution to your top level problem. Accepting an answer is your single most important contribution to the community, as it helps others that face the same problem you did.

ADD REPLYlink modified 12 days ago • written 12 days ago by Vijay Lakhujani2.5k
1
gravatar for cpad0112
12 days ago by
cpad01126.4k
cpad01126.4k wrote:

using awk: can handle multiple Ns, case insensitive for Ns:

$  awk '{gsub("[Nn]+","\n>\n");}1' test.fa | awk '/^>/ {if ($0 == ">") {$0=prev} prev=$0}1' | awk '/^>/ {getline seq} {if(seq!="") {print $0"\n"seq}}' | awk '(/^>/ && a[$0]++) {$0=$0"_"a[$0]}1' 
>chr1
ATCGCTGATCGAT
>chr1_2
CTAGCTAGCTAGCGTAGCTGCTAG
>chr1_3
CGTAGCCGTAGCT
>chr2
CCTGA
>chr2_2
GTAGT
>chr3
ATG
>chr4
C
>chr4_2
T

input: $ cat test.fa

>chr1
ATCGCTGATCGATNCTAGCTAGCTAGCGTAGCTGCTAGNCGTAGCCGTAGCT
>chr2
CCTGANGTAGT
>chr3
ATGNNNNNN
>chr4
NNNNCnT

make sure that sequences are single line. If they are not, use following command instead of above line:

$ seqkit seq -w0 test.fa | awk '{gsub("[Nn]+","\n>\n");}1' | awk '/^>/ {if ($0 == ">") {$0=prev} prev=$0}1' | awk '/^>/ {getline seq} {if(seq!="") {print $0"\n"seq}}' | awk '(/^>/ && a[$0]++) {$0=$0"_"a[$0]}1'

Download seqkit from here.

ADD COMMENTlink modified 12 days ago • written 12 days ago by cpad01126.4k
1

Another way producing same result of yours.

seqkit locate -P -p '[^Nn]+' seqs.fa  --gtf > contigs.gtf

seqkit subseq --gtf contigs.gtf seqs.fa | sed -r 's/_.+//' | seqkit rename | seqkit seq -i
ADD REPLYlink modified 12 days ago • written 12 days ago by shenwei3563.8k
1
gravatar for microfuge
12 days ago by
microfuge900
microfuge900 wrote:

Another approach based on UCSC utils (http://hgdownload.soe.ucsc.edu/admin/exe/linux.x86_64/) and bedtools . Only suitable for bigger files.

  • Convert fasta to twobit

faToTwoBit genome.fa genome.2bit

  • Get the locations of Ns in bed form

twoBitInfo -nBed genome.2bit N.bed

  • Use bedtools to get compliment of Ns and bedtools getfasta to get fasta for the compliments (The genome file can be generated with samtools faidx and cut ) samtools faidx genome.fa; cut -f 1,2 genome.fa.fai > genome.genome

bedtools complement -i N.bed -g genome.genome |bedtools getfasta -fi genome.fa -bed -

Advantage being speed and also getting the coordinates The output -

  • >chr1:0-13
  • ATCGCTGATCGAT
  • >chr1:14-38
  • CTAGCTAGCTAGCGTAGCTGCTAG
  • >chr1:39-52
  • CGTAGCCGTAGCT
ADD COMMENTlink modified 12 days ago • written 12 days ago by microfuge900
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 961 users visited in the last hour