Split sequence according the 'N' base
5
Hi,
I have some sequence and want to split the sequence by the N base using python or biopython, but I have not good idea how to split it. The sequence like this:
> chr1
ATCGCTGATCGATNCTAGCTAGCTAG
CGTAGCTGCTAGNCGTAGCCGTAGCT
what I need like this
> chr1_1
ATCGCTGATCGAT
> chr1_2
CTAGCTAGCTAGCGTAGCTGCTAG
> chr1_3
CGTAGCCGTAGCT
have any idea about it ?
Thanks
Alex
sequence
• 3.9k views
•
link
updated 6.9 years ago by
Biostar
20
•
written 7.0 years ago by
Alex
▴
50
A python solution
Dependencies: Biopython
from Bio import SeqIO
import getopt,sys,re
def usage( ) :
print "Usage: split.py -i <input_scaffold_fasta> -o <output_contig_fasta>"
try:
options, remainder= getopt.getopt( sys.argv[ 1:] , 'i:o:h' )
except getopt.GetoptError as err:
print str( err)
usage( )
sys.exit( )
for opt, arg in options:
if opt in ( '-i' ) :
input_file= arg
if opt in ( '-h' ) :
usage( )
sys.exit( )
elif opt in ( '-o' ) :
output_file= arg
out= open( output_file, 'w' )
sequence = '' .join( [ str( record.seq) .strip( ) for record in SeqIO.parse( input_file, "fasta" ) ] )
m= re.sub( '[nN]+' ,'\n' ,sequence) .split( '\n' )
for i in range( 0,len( m)) :
out.write( '>contig_' +str( i) +'\n' )
out.write( m[ i] +'\n' )
Usage
[ comp] $ python split.py -i input_fasta_file.fa -o output_fasta.fa
Output
> contig_0
ATCGCTGATCGAT
> contig_1
CTAGCTAGCTAGCGTAGCTGCTAG
> contig_2
CGTAGCCGTAGCT
just for fun:
use strict;
use warnings;
$/ = "\n>" ;
while ( < > ) {
s/> //g;
my ( $id , @seq) = split ( /\n+/, $_ ) ;
my @slices = split ( /N+/, join( "" , @seq)) ;
my $num = 0;
foreach my $seq ( @slices) {
$num ++;
print ">$id \_$num \n" ;
while ( $seq ) {
print substr( $seq , 0, 80) , "\n" ;
substr( $seq , 0, 80) = '' ;
}
}
}
Run:
$ perl split_by_Ns.pl < test.fa
> chr1_1
ATCGCTGATCGAT
> chr1_2
CTAGCTAGCTAGCGTAGCTGCTAG
> chr1_3
CGTAGCCGTAGCT
•
link
7.0 years ago by
JC
13k
Here's a way that doesn't use slow Python libraries, works with repeated stretches of Ns, and should be agnostic to Python 2 or 3:
import sys
import re
not_first = False
def process_seq( header, seq) :
seq = re.sub( r'[^\w\s]|(.)(?=\1)' , '' , seq)
match_idx = 1
posn_idx = 0
for match in re.finditer( r'N(.*?)' , seq) :
sys.stdout.write( "%s_%d\n%s\n" % ( header, match_idx, seq[ posn_idx:match.start( ) ] ))
posn_idx = match.end( )
match_idx += 1
sys.stdout.write( "%s_%d\n%s\n" % ( header, match_idx, seq[ posn_idx:] ))
for line in sys.stdin:
if line.startswith( '>' ) :
if not_first:
process_seq( header, seq)
header = line.rstrip( )
seq = ''
not_first = True
else:
seq += line.rstrip( )
if not_first:
process_seq( header, seq)
Output:
$ ./split_on_Ns.py < input.fa
> chr1_1
ATCGCTGATCGAT
> chr1_2
CTAGCTAGCTAGCGTAGCTGCTAG
> chr1_3
CGTAGCGTAGC
using awk: can handle multiple Ns, case insensitive for Ns:
$ awk '{gsub("[Nn]+","\n>\n");}1' test.fa | awk '/^>/ {if ($0 == ">") {$0 =prev} prev=$0 }1' | awk '/^>/ {getline seq} {if(seq!="") {print $0 "\n"seq}}' | awk '(/^>/ && a[$0 ]++) {$0 =$0 "_"a[$0 ]}1'
> chr1
ATCGCTGATCGAT
> chr1_2
CTAGCTAGCTAGCGTAGCTGCTAG
> chr1_3
CGTAGCCGTAGCT
> chr2
CCTGA
> chr2_2
GTAGT
> chr3
ATG
> chr4
C
> chr4_2
T
input:
$ cat test.fa
> chr1
ATCGCTGATCGATNCTAGCTAGCTAGCGTAGCTGCTAGNCGTAGCCGTAGCT
> chr2
CCTGANGTAGT
> chr3
ATGNNNNNN
> chr4
NNNNCnT
make sure that sequences are single line. If they are not, use following command instead of above line:
$ seqkit seq -w0 test.fa | awk '{gsub("[Nn]+","\n>\n");}1' | awk '/^>/ {if ($0 == ">") {$0 =prev} prev=$0 }1' | awk '/^>/ {getline seq} {if(seq!="") {print $0 "\n"seq}}' | awk '(/^>/ && a[$0 ]++) {$0 =$0 "_"a[$0 ]}1'
Download seqkit from here .
Another approach based on UCSC utils (http://hgdownload.soe.ucsc.edu/admin/exe/linux.x86_64/ ) and bedtools . Only suitable for bigger files.
faToTwoBit genome.fa genome.2bit
Get the locations of Ns in bed form
twoBitInfo -nBed genome.2bit N.bed
Use bedtools to get compliment of Ns and bedtools getfasta to get fasta for the compliments (The genome file can be generated with samtools faidx and cut )
samtools faidx genome.fa; cut -f 1,2 genome.fa.fai > genome.genome
bedtools complement -i N.bed -g genome.genome | bedtools getfasta -fi genome.fa -bed -
Advantage being speed and also getting the coordinates
The output -
>chr1:0-13
ATCGCTGATCGAT
>chr1:14-38
CTAGCTAGCTAGCGTAGCTGCTAG
>chr1:39-52
CGTAGCCGTAGCT
Login before adding your answer.
Traffic: 3851 users visited in the last hour
What do you want to happen when you reach 100 Ns in a row?
Sorry,I have forget it. if there exist many N base , we also want to split it util there have 'A','T','G'or 'C' base