Question: How to get contigs from scaffolds
0
gravatar for k.kathirvel93
2.8 years ago by
k.kathirvel93190
India
k.kathirvel93190 wrote:

Hi everyone,

How can i get contig file from the scaffold file (scaffold was generated from CLC). Is there any converter or programme? Ex : This is my scaffold : ACTGTGCATNNNNNNACGCTGCA and I want the contig file from scaffold like : Contig1 - ACTGTGCA and Contig2-ACGCTGCA

ADD COMMENTlink modified 2.8 years ago by lakhujanivijay4.2k • written 2.8 years ago by k.kathirvel93190

there question like this here http://seqanswers.com/forums/showthread.php?t=12993

ADD REPLYlink written 2.8 years ago by Medhat8.4k
2
gravatar for shenwei356
2.8 years ago by
shenwei3564.7k
China
shenwei3564.7k wrote:

Let me try with FASTA/Q toolkit SeqKit and shell commands.

Sample sequences:

$ cat seqs.fa
>scaffold1
ACTGTGCATNNNNNNACGCTGCA
>scaffold2
ACGACGACGCGATAGAGnnnnnnAGACGAGAG

Here are the commands:

cat seqs.fa                \
    | seqkit fx2tab        \
    | cut -f 2             \
    | sed -r 's/n+/\n/gi'  \
    | cat -n               \
    | seqkit tab2fx        \
    | seqkit replace -p "(.+)" -r "Contig{nr}"

Output:

>Contig1
ACTGTGCAT
>Contig2
ACGCTGCA
>Contig3
ACGACGACGCGATAGAG
>Contig4
AGACGAGAG

Let me explain step by step (do not run this):

cat seqs.fa                \ # read file

    | seqkit fx2tab        \ # convert FASTA to tabular format: scaffold1   ACTGTGCATNNNNNNACGCTGCA

    | cut -f 2             \ # select the 2nd column          : ACTGTGCATNNNNNNACGCTGCA

    | sed -r 's/n+/\n/gi'  \ # replace the Ns with '\n'       : ACTGTGCAT
                             #                                : ACGCTGCA

    | cat -n               \ # output row number              :     1  ACTGTGCAT
                             # I just want a 2-column format  :     2  ACGCTGCA

    | seqkit tab2fx        \ # convert tabular to FASTA       : >1
                             #                                : ACTGTGCAT

    | seqkit replace -p "(.+)" -r "Contig{nr}" # renname sequence header, {nr} means row number
ADD COMMENTlink modified 2.8 years ago • written 2.8 years ago by shenwei3564.7k
2
gravatar for lakhujanivijay
2.8 years ago by
lakhujanivijay4.2k
India
lakhujanivijay4.2k wrote:

You can use the below Python script which I wrote sometime back.

Usage:

python contig_from_scaffold.py -i <input_scaffold_fasta> -o <output_contig_fasta>

Here is the script

#!/usr/bin/python2.7

from Bio import SeqIO
import getopt,sys,re


def usage():
    print "Usage: python contig_from_scaffold.py -i <input_scaffold_fasta> -o <output_contig_fasta>"

try:
    options, remainder=getopt.getopt(sys.argv[1:], 'i:o:h')

except getopt.GetoptError as err:
    print str(err)
    usage()
    sys.exit()

for opt, arg in options:
    if opt in ('-i'):
        input_file=arg
    if opt in ('-h'):
        usage()
    sys.exit()
    elif opt in ('-o'):
        output_file=arg

out=open(output_file, 'w')

sequence = ''.join([str(record.seq).strip() for record in SeqIO.parse(input_file, "fasta")])

m=re.sub('[nN]+','\n',sequence).split('\n')

for i in range(1,len(m)):
    out.write('>contig_'+str(i)+'\n')
    out.write(m[i]+'\n')

Input:

>scaffold1
ACTGTGCATNNNNNNACGCTGCANnnNNCTGCAnnnCTGCAnnNNNNCTGCA
>scaffold2
ACGACGACGCGATAGAGnnnnnnAGACGAGAGNNNnnACGACGACG

Output:

>contig_1
ACGCTGCA
>contig_2
CTGCA
>contig_3
CTGCA
>contig_4
CTGCAACGACGACGCGATAGAG
>contig_5
AGACGAGAG
>contig_6
ACGACGACG
ADD COMMENTlink written 2.8 years ago by lakhujanivijay4.2k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1832 users visited in the last hour