Question

trying to remove empty fasta sequence then creating a new file

0

Entering edit mode

2.4 years ago

rt2421 • 0

I am working in python

from Bio import *
    for current_seq in SeqIO.parse("mysequences.fasta", "fasta"):
        if current_seq.seq.tostring() != '':
            print( ">", current_seq.id, "\n", current_seq.seq.tostring())

I have this so far but there is an error. Also how would I create a file of the cleaned up version?

python fasta • 1.6k views

ADD COMMENT • link updated 2.4 years ago by cpad0112 21k • written 2.4 years ago by rt2421 • 0

0

Entering edit mode

Please clarify if you need assistance with python code or other solutions will work.

Use 101010 button to format code portion of your post properly. I have done it for you this time.

ADD REPLY • link 2.4 years ago by GenoMax 142k

0

Entering edit mode

First you want to collect the good records as a list and then write to file. Biopython writing FASTA files is quite well covered in the the documentation here. The first example there under 'Examples: Input/Output Example - Filtering by sequence length' is fairly similar to your needs and so it should be straightforward to adapt.

ADD REPLY • link 2.4 years ago by Wayne ★ 2.0k

score 0 · Answer 1 · 2021-12-20

0

Entering edit mode

2.4 years ago

ATpoint 82k

Using bioawk, assuming the file is sequences.fa:

bioawk -c fastx '{ print $name, $seq, length($seq) }' < sequences.fa | awk 'OFS="\n" {if($3>0) print $1, $2}'

ADD COMMENT • link 2.4 years ago by ATpoint 82k

0

Entering edit mode

Thank you for the help. The files I am using are fasta files. Also, would this command create the edited file?

ADD REPLY • link 2.4 years ago by rt2421 • 0

0

Entering edit mode

.fa is a common suffix for fasta, and yes, the output is the cleaned file.

Example:

>chr1
>chr2
TAGCTAGCT

$ bioawk -c fastx '{ print $name, $seq, length($seq) }' < sequences.fa | awk 'OFS="\n" {if($3>0) print ">"$1, $2}'
chr2
TAGCTAGCT

ADD REPLY • link 2.4 years ago by ATpoint 82k

1

Entering edit mode

$ bioawk -c fastx '{ print $name, $seq, length($seq) }' < test.fa | awk 'OFS="\n" {if($3>0) print ">"$1, $2}'

Output would be in fasta format.

ADD REPLY • link 2.4 years ago by cpad0112 21k

0

Entering edit mode

Ah, forgot the damn >, good catch. Edited.

ADD REPLY • link 2.4 years ago by ATpoint 82k

score 0 · Answer 2 · 2021-12-21

outstide python:

$ cutadapt -m 1 test.fa
$ awk -v RS='>' '{print ($2!="")?">"$1"\n"$2:""}' test.fa | awk NF
$ seqkit -w 0 seq -gm 1 test.fa

 >chr2
 TAGCTAGCT

For python:

from Bio import SeqIO
with open("new.fa", "w") as o:
    for current_seq in SeqIO.parse("test.fa", "fasta"):
        if len(current_seq.seq) != 0 :
            SeqIO.write(current_seq, o, "fasta")

test.fa is input fasta and new.fa is output fasta. Output fasta will be written in the same directory. If you are familiar with list comprehension in python, following is the way:

from Bio import SeqIO
old_seq=SeqIO.parse("test.fa", "fasta")
new_seq=[seq for seq in old_seq if len(seq.seq) != 0]
SeqIO.write(new_seq, "new.fa", "fasta")

Please read biopython documentation/tutorial. https://biopython.org/wiki/SeqIO. There is an example for filtering sequences by length.