Question: Split sequence according the 'N' base
0
gravatar for Alex
6 months ago by
Alex20
America
Alex20 wrote:

Hi, I have some sequence and want to split the sequence by the N base using python or biopython, but I have not good idea how to split it. The sequence like this:

>chr1
ATCGCTGATCGATNCTAGCTAGCTAG
CGTAGCTGCTAGNCGTAGCCGTAGCT

what I need like this

>chr1_1

ATCGCTGATCGAT

>chr1_2

CTAGCTAGCTAGCGTAGCTGCTAG

>chr1_3

CGTAGCCGTAGCT

have any idea about it ?

Thanks

Alex

sequence • 447 views
ADD COMMENTlink modified 4 months ago by Biostar ♦♦ 20 • written 6 months ago by Alex20

What do you want to happen when you reach 100 Ns in a row?

ADD REPLYlink written 6 months ago by d-cameron1.9k

Sorry,I have forget it. if there exist many N base , we also want to split it util there have 'A','T','G'or 'C' base

ADD REPLYlink written 6 months ago by Alex20
4
gravatar for JC
6 months ago by
JC7.0k
Mexico
JC7.0k wrote:

just for fun:

#!/usr/bin/perl

use strict;
use warnings;

$/ = "\n>";

while (<>) {
    s/>//g;
    my ($id, @seq) = split (/\n+/, $_);
    my @slices = split (/N+/, join("", @seq));
    my $num = 0;
    foreach my $seq (@slices) {
        $num++;
        print ">$id\_$num\n";
        while ($seq) {
            print substr($seq, 0, 80), "\n";
            substr($seq, 0, 80) = '';
        }
    }
}

Run:

$ perl split_by_Ns.pl < test.fa
>chr1_1
ATCGCTGATCGAT
>chr1_2
CTAGCTAGCTAGCGTAGCTGCTAG
>chr1_3
CGTAGCCGTAGCT
ADD COMMENTlink written 6 months ago by JC7.0k
3
gravatar for Alex Reynolds
6 months ago by
Alex Reynolds26k
Seattle, WA USA
Alex Reynolds26k wrote:

Here's a way that doesn't use slow Python libraries, works with repeated stretches of Ns, and should be agnostic to Python 2 or 3:

#!/usr/bin/env python

import sys
import re

not_first = False

def process_seq(header, seq):
    # reduce repeated Ns                                                                                                                                                                                                                          
    seq = re.sub(r'[^\w\s]|(.)(?=\1)', '', seq)
    # get indices of Ns                                                                                                                                                                                                                           
    match_idx = 1
    posn_idx = 0
    for match in re.finditer(r'N(.*?)', seq):
        sys.stdout.write("%s_%d\n%s\n" % (header, match_idx, seq[posn_idx:match.start()]))
        posn_idx = match.end()
        match_idx += 1
    sys.stdout.write("%s_%d\n%s\n" % (header, match_idx, seq[posn_idx:]))

for line in sys.stdin:
    if line.startswith('>'):
        if not_first:
            process_seq(header, seq)
        header = line.rstrip()
        seq = ''
        not_first = True
    else:
        seq += line.rstrip()

if not_first:
    process_seq(header, seq)

Output:

$ ./split_on_Ns.py < input.fa
>chr1_1
ATCGCTGATCGAT
>chr1_2
CTAGCTAGCTAGCGTAGCTGCTAG
>chr1_3
CGTAGCGTAGC
ADD COMMENTlink modified 6 months ago • written 6 months ago by Alex Reynolds26k
2
gravatar for Vijay Lakhujani
6 months ago by
Vijay Lakhujani3.4k
India
Vijay Lakhujani3.4k wrote:

A python solution

Dependencies: Biopython

#!/usr/bin/python2.7

from Bio import SeqIO
import getopt,sys,re


def usage():
    print "Usage: split.py -i <input_scaffold_fasta> -o <output_contig_fasta>"

try:
    options, remainder=getopt.getopt(sys.argv[1:], 'i:o:h')

except getopt.GetoptError as err:
    print str(err)
    usage()
    sys.exit()

for opt, arg in options:
    if opt in ('-i'):
        input_file=arg
    if opt in ('-h'):
        usage()
    sys.exit()
    elif opt in ('-o'):
        output_file=arg

out=open(output_file, 'w')

sequence = ''.join([str(record.seq).strip() for record in SeqIO.parse(input_file, "fasta")])

m=re.sub('[nN]+','\n',sequence).split('\n')

for i in range(0,len(m)):
    out.write('>contig_'+str(i)+'\n')
    out.write(m[i]+'\n')

Usage

[comp]$ python split.py -i input_fasta_file.fa -o output_fasta.fa

Output

>contig_0
ATCGCTGATCGAT
>contig_1
CTAGCTAGCTAGCGTAGCTGCTAG
>contig_2
CGTAGCCGTAGCT
ADD COMMENTlink modified 6 months ago • written 6 months ago by Vijay Lakhujani3.4k

Thanks, looks great

Aelx

ADD REPLYlink written 6 months ago by Alex20

Alex, you should upvote OR accept a post as answer if any one or several posts helped or answered your question

click here to know how

Remember, you can accept multiple answers, as long as they provide valid solution to your top level problem. Accepting an answer is your single most important contribution to the community, as it helps others that face the same problem you did.

ADD REPLYlink modified 6 months ago • written 6 months ago by Vijay Lakhujani3.4k
2
gravatar for cpad0112
6 months ago by
cpad011210k
India
cpad011210k wrote:

using awk: can handle multiple Ns, case insensitive for Ns:

$  awk '{gsub("[Nn]+","\n>\n");}1' test.fa | awk '/^>/ {if ($0 == ">") {$0=prev} prev=$0}1' | awk '/^>/ {getline seq} {if(seq!="") {print $0"\n"seq}}' | awk '(/^>/ && a[$0]++) {$0=$0"_"a[$0]}1' 
>chr1
ATCGCTGATCGAT
>chr1_2
CTAGCTAGCTAGCGTAGCTGCTAG
>chr1_3
CGTAGCCGTAGCT
>chr2
CCTGA
>chr2_2
GTAGT
>chr3
ATG
>chr4
C
>chr4_2
T

input: $ cat test.fa

>chr1
ATCGCTGATCGATNCTAGCTAGCTAGCGTAGCTGCTAGNCGTAGCCGTAGCT
>chr2
CCTGANGTAGT
>chr3
ATGNNNNNN
>chr4
NNNNCnT

make sure that sequences are single line. If they are not, use following command instead of above line:

$ seqkit seq -w0 test.fa | awk '{gsub("[Nn]+","\n>\n");}1' | awk '/^>/ {if ($0 == ">") {$0=prev} prev=$0}1' | awk '/^>/ {getline seq} {if(seq!="") {print $0"\n"seq}}' | awk '(/^>/ && a[$0]++) {$0=$0"_"a[$0]}1'

Download seqkit from here.

ADD COMMENTlink modified 6 months ago • written 6 months ago by cpad011210k
1

Another way producing same result of yours.

seqkit locate -P -p '[^Nn]+' seqs.fa  --gtf > contigs.gtf

seqkit subseq --gtf contigs.gtf seqs.fa | sed -r 's/_.+//' | seqkit rename | seqkit seq -i
ADD REPLYlink modified 6 months ago • written 6 months ago by shenwei3564.3k
2
gravatar for microfuge
6 months ago by
microfuge950
microfuge950 wrote:

Another approach based on UCSC utils (http://hgdownload.soe.ucsc.edu/admin/exe/linux.x86_64/) and bedtools . Only suitable for bigger files.

  • Convert fasta to twobit

faToTwoBit genome.fa genome.2bit

  • Get the locations of Ns in bed form

twoBitInfo -nBed genome.2bit N.bed

  • Use bedtools to get compliment of Ns and bedtools getfasta to get fasta for the compliments (The genome file can be generated with samtools faidx and cut ) samtools faidx genome.fa; cut -f 1,2 genome.fa.fai > genome.genome

bedtools complement -i N.bed -g genome.genome |bedtools getfasta -fi genome.fa -bed -

Advantage being speed and also getting the coordinates The output -

  • >chr1:0-13
  • ATCGCTGATCGAT
  • >chr1:14-38
  • CTAGCTAGCTAGCGTAGCTGCTAG
  • >chr1:39-52
  • CGTAGCCGTAGCT
ADD COMMENTlink modified 6 months ago • written 6 months ago by microfuge950
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1594 users visited in the last hour