Question: Split fasta to multi fasta
0
gravatar for felipelira3
2.1 years ago by
felipelira30 wrote:

I have a scaffold with N`s inside but I want to split it into separated contigs. The first reason is because I have N's and the other is because I have non-IUPAC characters into my sequence. Just trying I split by N's and eliminated the sequences smaller than 1.

Any suggestion using SeqIO or any other tool.

myfile.fasta

>mysequence
agtagatgatgatagatgatgatgaNNNNtgttgcatgctagctagctagtcgatcgatcgatcgtagctagcaNNNtcgatcgatgtagctagctgacaNctagtcgatgca

my temporary output.fasta using this:

sed -i.bak 's/N/\n>N\n/g' myfile.fasta

>N
agtagatgatgatagatgatgatga
>N

>N

>N

>N

>N
tgttgcatgctagctagctagtcgatcgatcgatcgtagctagca

>N

>N

>N
tcgatcgatgtagctagctgacaNctagtcgatgca

Further I eliminate the NULL sequences or filter >500 to obtain a reasonable set of sequences.

>N
agtagatgatgatagatgatgatga
>N
tgttgcatgctagctagctagtcgatcgatcgatcgtagctagca
>N
tcgatcgatgtagctagctgacaNctagtcgatgca

The problem you can imagine. All sequences have the same names.

How to enumerate them in this order?

sequence assembly • 808 views
ADD COMMENTlink modified 2.1 years ago by shenwei3564.6k • written 2.1 years ago by felipelira30

's/N/\n>N\n/g' takes effect on the FASTA headers. So you have to discard seq names first. And you should use regular expression [Nn]+' instead ofN` for splitting.

ADD REPLYlink written 2.1 years ago by shenwei3564.6k
1
gravatar for Pierre Lindenbaum
2.1 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum120k wrote:

linearize and split with awk:

 awk '/^>/ {printf("%s%s\t",(N>0?"\n":""),$0);N++;next;} {printf("%s",$0);} END {printf("\n");}' input.fasta  |\
 awk -F '\t' '{out=($2 ~ /[nN]/?"file1.fa":"file2.fa"); printf("%s\n%s\n",$1,$2) >> out;}'
ADD COMMENTlink modified 2.1 years ago • written 2.1 years ago by Pierre Lindenbaum120k

When you wrote the lines your intention was:

... {printf("\n");}' input.fasta |\ awk -F '\t' ...

no? Thank you

ADD REPLYlink written 2.1 years ago by felipelira30

no .

ADD REPLYlink written 2.1 years ago by Pierre Lindenbaum120k

So there are two separate lines?

ADD REPLYlink written 2.1 years ago by felipelira30

no .

ADD REPLYlink written 2.1 years ago by Pierre Lindenbaum120k
0
gravatar for st.ph.n
2.1 years ago by
st.ph.n2.4k
Philadelphia, PA
st.ph.n2.4k wrote:

Since you've updated, and asked for any tool, here's another python script.

#!/usr/bin/env python
import sys

with open(sys.argv[1], 'r') as f:
        with open(sys.argv[2], 'w') as out:
                header = next(f).strip()
                myseq = ''.join([line.strip() for line in f]).split('N')
                for n in range(len(myseq)):
                        if 'N' not in myseq[n]:
                                if len(myseq[n]) > 1:
                                        out.write(header + '_' + str(n), '\n', myseq[n]

This takes the input file with a single sequence, grabs the header and sequence. The sequence is split into a list, by N chars. For each item in that list, N is not in that portion of the sequence, and the length of that portion is > 1, it prints out to the new fasta file, the header plus '_n', where n is the index of the portion in the list.

Save as split_fasta.py, run as python split_fasta.py infile.fasta outfile.fasta

ADD COMMENTlink modified 2.1 years ago • written 2.1 years ago by st.ph.n2.4k

It did not work. The output is the same input sequence.

ADD REPLYlink written 2.1 years ago by felipelira30

I've added a new script, that doesn't use bipython.

ADD REPLYlink modified 2.1 years ago • written 2.1 years ago by st.ph.n2.4k
0
gravatar for 5heikki
2.1 years ago by
5heikki8.4k
Finland
5heikki8.4k wrote:

Original (fails with multifasta):

awk 'BEGIN{RS="N"; OFS="\n"; i=1}{if(length($NF)>1){print ">seq_"i,$NF; i++}}' seq.fa

New version (should work with everything):

sed 's/^>.*/N/' seqs.fa | awk 'BEGIN{RS="N";OFS="\n";i=1}{if(length($0)>1){print ">seq_"i,$0; i++}}' | grep .

Doesn't matter if there are linebreaks in sequences..

ADD COMMENTlink modified 2.1 years ago • written 2.1 years ago by 5heikki8.4k

This is cool, but there's a little bug?

$ cat seqs.fa 
>mysequence
agtagatgatgatagatgatgatgaNNNNtgttgcatgctagctagctagtcgatcgatc
gatcgtagctagcaNNNtcgatcgatgtagctagctgacaNctagtcgatgca
>another
aaaaaccccccNNNggggggggNttt

$ cat seqs.fa| awk 'BEGIN{RS="N"; OFS="\n"; i=1}{if(length($NF)>1){print ">seq_"i,$NF; i++}}'
>seq_1
agtagatgatgatagatgatgatga
>seq_2
gatcgtagctagca
>seq_3
tcgatcgatgtagctagctgaca
>seq_4
aaaaacccccc
>seq_5
gggggggg
>seq_6
ttt
ADD REPLYlink written 2.1 years ago by shenwei3564.6k
1

This could work, but is not elegant at all

sed 's/^>.*/N/' seqs.fa | awk 'BEGIN{RS="N";OFS="\n";i=1}{if(length($0)>1){print ">seq_"i,$0; i++}}' | grep .
ADD REPLYlink written 2.1 years ago by 5heikki8.4k

Not just a little bug. It doesn't work at all. Lesson of the day: don't post answers while tired.

ADD REPLYlink written 2.1 years ago by 5heikki8.4k
1

Or don't post answers that were not tested.

ADD REPLYlink written 2.1 years ago by shenwei3564.6k

It worked fine with OP's example data though

ADD REPLYlink written 2.1 years ago by 5heikki8.4k

oh, I see I see.


ADD REPLYlink modified 2.1 years ago • written 2.1 years ago by shenwei3564.6k
0
gravatar for shenwei356
2.1 years ago by
shenwei3564.6k
China
shenwei3564.6k wrote:

shell + seqkit:

Steps:

  1. format the FASTA in single line: seqkit seq -s -w 0
  2. replace Ns with \n: sed -r 's/[Nn]+/\n/g'
  3. add number for every line, and convert tab-delimited format to FASTA: cat -n | seqkit tab2fx
  4. rename the FASTA header in format of contig_n: seqkit replace -p '.+' -r 'contig_{nr}'

Sample data:

$ cat seqs.fa 
>mysequence
agtagatgatgatagatgatgatgaNNNNtgttgcatgctagctagctagtcgatcgatc
gatcgtagctagcaNNNtcgatcgatgtagctagctgacaNctagtcgatgca
>another
aaaaaccccccNNNggggggggNttt

Commands:

$ cat seqs.fa | seqkit seq -s -w 0 \
    | sed -r 's/[Nn]+/\n/g' \
    | cat -n | seqkit tab2fx \
    | seqkit replace -p '.+' -r 'contig_{nr}'

>contig_1
agtagatgatgatagatgatgatga
>contig_2
tgttgcatgctagctagctagtcgatcgatcgatcgtagctagca
>contig_3
tcgatcgatgtagctagctgaca
>contig_4
ctagtcgatgca
>contig_5
aaaaacccccc
>contig_6
gggggggg
>contig_7
ttt

Ways to filter by length:

seqkit seq --min-len 100 --max-len 1000
ADD COMMENTlink modified 2.1 years ago • written 2.1 years ago by shenwei3564.6k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 587 users visited in the last hour