How to get contigs from scaffolds
2
0
Entering edit mode
8.1 years ago
k.kathirvel93 ▴ 310

Hi everyone,

How can i get contig file from the scaffold file (scaffold was generated from CLC). Is there any converter or programme? Ex : This is my scaffold : ACTGTGCATNNNNNNACGCTGCA and I want the contig file from scaffold like : Contig1 - ACTGTGCA and Contig2-ACGCTGCA

Assembly next-gen sequencing alignment genome • 7.5k views
ADD COMMENT
0
Entering edit mode
ADD REPLY
4
Entering edit mode
8.0 years ago

Let me try with FASTA/Q toolkit SeqKit and shell commands.

Sample sequences:

$ cat seqs.fa
>scaffold1
ACTGTGCATNNNNNNACGCTGCA
>scaffold2
ACGACGACGCGATAGAGnnnnnnAGACGAGAG

Here are the commands:

cat seqs.fa                \
    | seqkit fx2tab        \
    | cut -f 2             \
    | sed -r 's/n+/\n/gi'  \
    | cat -n               \
    | seqkit tab2fx        \
    | seqkit replace -p "(.+)" -r "Contig{nr}"

Output:

>Contig1
ACTGTGCAT
>Contig2
ACGCTGCA
>Contig3
ACGACGACGCGATAGAG
>Contig4
AGACGAGAG

Let me explain step by step (do not run this):

cat seqs.fa                \ # read file

    | seqkit fx2tab        \ # convert FASTA to tabular format: scaffold1   ACTGTGCATNNNNNNACGCTGCA

    | cut -f 2             \ # select the 2nd column          : ACTGTGCATNNNNNNACGCTGCA

    | sed -r 's/n+/\n/gi'  \ # replace the Ns with '\n'       : ACTGTGCAT
                             #                                : ACGCTGCA

    | cat -n               \ # output row number              :     1  ACTGTGCAT
                             # I just want a 2-column format  :     2  ACGCTGCA

    | seqkit tab2fx        \ # convert tabular to FASTA       : >1
                             #                                : ACTGTGCAT

    | seqkit replace -p "(.+)" -r "Contig{nr}" # renname sequence header, {nr} means row number
ADD COMMENT
4
Entering edit mode
8.0 years ago

You can use the below Python script which I wrote sometime back.

Usage:

python contig_from_scaffold.py -i <input_scaffold_fasta> -o <output_contig_fasta>

Here is the script

#!/usr/bin/python2.7

from Bio import SeqIO
import getopt,sys,re


def usage():
    print "Usage: python contig_from_scaffold.py -i <input_scaffold_fasta> -o <output_contig_fasta>"

try:
    options, remainder=getopt.getopt(sys.argv[1:], 'i:o:h')

except getopt.GetoptError as err:
    print str(err)
    usage()
    sys.exit()

for opt, arg in options:
    if opt in ('-i'):
        input_file=arg
    if opt in ('-h'):
        usage()
    sys.exit()
    elif opt in ('-o'):
        output_file=arg

out=open(output_file, 'w')

sequence = ''.join([str(record.seq).strip() for record in SeqIO.parse(input_file, "fasta")])

m=re.sub('[nN]+','\n',sequence).split('\n')

for i in range(1,len(m)):
    out.write('>contig_'+str(i)+'\n')
    out.write(m[i]+'\n')

Input:

>scaffold1
ACTGTGCATNNNNNNACGCTGCANnnNNCTGCAnnnCTGCAnnNNNNCTGCA
>scaffold2
ACGACGACGCGATAGAGnnnnnnAGACGAGAGNNNnnACGACGACG

Output:

>contig_1
ACGCTGCA
>contig_2
CTGCA
>contig_3
CTGCA
>contig_4
CTGCAACGACGACGCGATAGAG
>contig_5
AGACGAGAG
>contig_6
ACGACGACG
ADD COMMENT

Login before adding your answer.

Traffic: 815 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6