Split fasta to multi fasta
4
0
Entering edit mode
7.1 years ago

I have a scaffold with N`s inside but I want to split it into separated contigs. The first reason is because I have N's and the other is because I have non-IUPAC characters into my sequence. Just trying I split by N's and eliminated the sequences smaller than 1.

Any suggestion using SeqIO or any other tool.

myfile.fasta

>mysequence
agtagatgatgatagatgatgatgaNNNNtgttgcatgctagctagctagtcgatcgatcgatcgtagctagcaNNNtcgatcgatgtagctagctgacaNctagtcgatgca

my temporary output.fasta using this:

sed -i.bak 's/N/\n>N\n/g' myfile.fasta

>N
agtagatgatgatagatgatgatga
>N

>N

>N

>N

>N
tgttgcatgctagctagctagtcgatcgatcgatcgtagctagca

>N

>N

>N
tcgatcgatgtagctagctgacaNctagtcgatgca

Further I eliminate the NULL sequences or filter >500 to obtain a reasonable set of sequences.

>N
agtagatgatgatagatgatgatga
>N
tgttgcatgctagctagctagtcgatcgatcgatcgtagctagca
>N
tcgatcgatgtagctagctgacaNctagtcgatgca

The problem you can imagine. All sequences have the same names.

How to enumerate them in this order?

sequence Assembly • 3.2k views
ADD COMMENT
0
Entering edit mode

's/N/\n>N\n/g' takes effect on the FASTA headers. So you have to discard seq names first. And you should use regular expression [Nn]+' instead ofN` for splitting.

ADD REPLY
1
Entering edit mode
7.1 years ago

linearize and split with awk:

 awk '/^>/ {printf("%s%s\t",(N>0?"\n":""),$0);N++;next;} {printf("%s",$0);} END {printf("\n");}' input.fasta  |\
 awk -F '\t' '{out=($2 ~ /[nN]/?"file1.fa":"file2.fa"); printf("%s\n%s\n",$1,$2) >> out;}'
ADD COMMENT
0
Entering edit mode

When you wrote the lines your intention was:

... {printf("\n");}' input.fasta |\ awk -F '\t' ...

no? Thank you

ADD REPLY
0
Entering edit mode

no .

ADD REPLY
0
Entering edit mode

So there are two separate lines?

ADD REPLY
0
Entering edit mode

no .

ADD REPLY
0
Entering edit mode
7.1 years ago
st.ph.n ★ 2.7k

Since you've updated, and asked for any tool, here's another python script.

#!/usr/bin/env python
import sys

with open(sys.argv[1], 'r') as f:
        with open(sys.argv[2], 'w') as out:
                header = next(f).strip()
                myseq = ''.join([line.strip() for line in f]).split('N')
                for n in range(len(myseq)):
                        if 'N' not in myseq[n]:
                                if len(myseq[n]) > 1:
                                        out.write(header + '_' + str(n), '\n', myseq[n]

This takes the input file with a single sequence, grabs the header and sequence. The sequence is split into a list, by N chars. For each item in that list, N is not in that portion of the sequence, and the length of that portion is > 1, it prints out to the new fasta file, the header plus '_n', where n is the index of the portion in the list.

Save as split_fasta.py, run as python split_fasta.py infile.fasta outfile.fasta

ADD COMMENT
0
Entering edit mode

It did not work. The output is the same input sequence.

ADD REPLY
0
Entering edit mode

I've added a new script, that doesn't use bipython.

ADD REPLY
0
Entering edit mode
7.1 years ago
5heikki 11k

Original (fails with multifasta):

awk 'BEGIN{RS="N"; OFS="\n"; i=1}{if(length($NF)>1){print ">seq_"i,$NF; i++}}' seq.fa

New version (should work with everything):

sed 's/^>.*/N/' seqs.fa | awk 'BEGIN{RS="N";OFS="\n";i=1}{if(length($0)>1){print ">seq_"i,$0; i++}}' | grep .

Doesn't matter if there are linebreaks in sequences..

ADD COMMENT
0
Entering edit mode

This is cool, but there's a little bug?

$ cat seqs.fa 
>mysequence
agtagatgatgatagatgatgatgaNNNNtgttgcatgctagctagctagtcgatcgatc
gatcgtagctagcaNNNtcgatcgatgtagctagctgacaNctagtcgatgca
>another
aaaaaccccccNNNggggggggNttt

$ cat seqs.fa| awk 'BEGIN{RS="N"; OFS="\n"; i=1}{if(length($NF)>1){print ">seq_"i,$NF; i++}}'
>seq_1
agtagatgatgatagatgatgatga
>seq_2
gatcgtagctagca
>seq_3
tcgatcgatgtagctagctgaca
>seq_4
aaaaacccccc
>seq_5
gggggggg
>seq_6
ttt
ADD REPLY
1
Entering edit mode

This could work, but is not elegant at all

sed 's/^>.*/N/' seqs.fa | awk 'BEGIN{RS="N";OFS="\n";i=1}{if(length($0)>1){print ">seq_"i,$0; i++}}' | grep .
ADD REPLY
0
Entering edit mode

Not just a little bug. It doesn't work at all. Lesson of the day: don't post answers while tired.

ADD REPLY
1
Entering edit mode

Or don't post answers that were not tested.

ADD REPLY
0
Entering edit mode

It worked fine with OP's example data though

ADD REPLY
0
Entering edit mode

oh, I see I see.


ADD REPLY
0
Entering edit mode
7.1 years ago

shell + seqkit:

Steps:

  1. format the FASTA in single line: seqkit seq -s -w 0
  2. replace Ns with \n: sed -r 's/[Nn]+/\n/g'
  3. add number for every line, and convert tab-delimited format to FASTA: cat -n | seqkit tab2fx
  4. rename the FASTA header in format of contig_n: seqkit replace -p '.+' -r 'contig_{nr}'

Sample data:

$ cat seqs.fa 
>mysequence
agtagatgatgatagatgatgatgaNNNNtgttgcatgctagctagctagtcgatcgatc
gatcgtagctagcaNNNtcgatcgatgtagctagctgacaNctagtcgatgca
>another
aaaaaccccccNNNggggggggNttt

Commands:

$ cat seqs.fa | seqkit seq -s -w 0 \
    | sed -r 's/[Nn]+/\n/g' \
    | cat -n | seqkit tab2fx \
    | seqkit replace -p '.+' -r 'contig_{nr}'

>contig_1
agtagatgatgatagatgatgatga
>contig_2
tgttgcatgctagctagctagtcgatcgatcgatcgtagctagca
>contig_3
tcgatcgatgtagctagctgaca
>contig_4
ctagtcgatgca
>contig_5
aaaaacccccc
>contig_6
gggggggg
>contig_7
ttt

Ways to filter by length:

seqkit seq --min-len 100 --max-len 1000
ADD COMMENT

Login before adding your answer.

Traffic: 2388 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6