How to get contigs from scaffolds
2
Hi everyone,
How can i get contig file from the scaffold file (scaffold was generated from CLC). Is there any converter or programme? Ex : This is my scaffold : ACTGTGCATNNNNNNACGCTGCA and I want the contig file from scaffold like : Contig1 - ACTGTGCA and Contig2-ACGCTGCA
Assembly
next-gen
sequencing
alignment
genome
• 8.0k views
Let me try with FASTA/Q toolkit SeqKit and shell commands.
Sample sequences:
$ cat seqs.fa
> scaffold1
ACTGTGCATNNNNNNACGCTGCA
> scaffold2
ACGACGACGCGATAGAGnnnnnnAGACGAGAG
Here are the commands:
cat seqs.fa \
| seqkit fx2tab \
| cut -f 2 \
| sed -r 's/n+/\n/gi' \
| cat -n \
| seqkit tab2fx \
| seqkit replace -p "(.+)" -r "Contig{nr}"
Output:
> Contig1
ACTGTGCAT
> Contig2
ACGCTGCA
> Contig3
ACGACGACGCGATAGAG
> Contig4
AGACGAGAG
Let me explain step by step (do not run this):
cat seqs.fa \
| seqkit fx2tab \
| cut -f 2 \
| sed -r 's/n+/\n/gi' \
| cat -n \
| seqkit tab2fx \
| seqkit replace -p "(.+)" -r "Contig{nr}"
You can use the below Python script which I wrote sometime back.
Usage:
python contig_from_scaffold.py -i < input_scaffold_fasta> -o < output_contig_fasta>
Here is the script
from Bio import SeqIO
import getopt,sys,re
def usage( ) :
print "Usage: python contig_from_scaffold.py -i <input_scaffold_fasta> -o <output_contig_fasta>"
try:
options, remainder= getopt.getopt( sys.argv[ 1:] , 'i:o:h' )
except getopt.GetoptError as err:
print str( err)
usage( )
sys.exit( )
for opt, arg in options:
if opt in ( '-i' ) :
input_file= arg
if opt in ( '-h' ) :
usage( )
sys.exit( )
elif opt in ( '-o' ) :
output_file= arg
out= open( output_file, 'w' )
sequence = '' .join( [ str( record.seq) .strip( ) for record in SeqIO.parse( input_file, "fasta" ) ] )
m= re.sub( '[nN]+' ,'\n' ,sequence) .split( '\n' )
for i in range( 1,len( m)) :
out.write( '>contig_' +str( i) +'\n' )
out.write( m[ i] +'\n' )
Input:
> scaffold1
ACTGTGCATNNNNNNACGCTGCANnnNNCTGCAnnnCTGCAnnNNNNCTGCA
> scaffold2
ACGACGACGCGATAGAGnnnnnnAGACGAGAGNNNnnACGACGACG
Output:
> contig_1
ACGCTGCA
> contig_2
CTGCA
> contig_3
CTGCA
> contig_4
CTGCAACGACGACGCGATAGAG
> contig_5
AGACGAGAG
> contig_6
ACGACGACG
Login before adding your answer.
Traffic: 2146 users visited in the last hour
there question like this here http://seqanswers.com/forums/showthread.php?t=12993