trying to remove empty fasta sequence then creating a new file
2
0
Entering edit mode
2.4 years ago
rt2421 • 0

I am working in python

from Bio import *
    for current_seq in SeqIO.parse("mysequences.fasta", "fasta"):
        if current_seq.seq.tostring() != '':
            print( ">", current_seq.id, "\n", current_seq.seq.tostring())

I have this so far but there is an error. Also how would I create a file of the cleaned up version?

python fasta • 1.6k views
ADD COMMENT
0
Entering edit mode

Please clarify if you need assistance with python code or other solutions will work.

Use 101010 button to format code portion of your post properly. I have done it for you this time.

ADD REPLY
0
Entering edit mode

First you want to collect the good records as a list and then write to file. Biopython writing FASTA files is quite well covered in the the documentation here. The first example there under 'Examples: Input/Output Example - Filtering by sequence length' is fairly similar to your needs and so it should be straightforward to adapt.

ADD REPLY
0
Entering edit mode
2.4 years ago
ATpoint 82k

Using bioawk, assuming the file is sequences.fa:

bioawk -c fastx '{ print $name, $seq, length($seq) }' < sequences.fa | awk 'OFS="\n" {if($3>0) print $1, $2}'
ADD COMMENT
0
Entering edit mode

Thank you for the help. The files I am using are fasta files. Also, would this command create the edited file?

ADD REPLY
0
Entering edit mode

.fa is a common suffix for fasta, and yes, the output is the cleaned file.

Example:

>chr1
>chr2
TAGCTAGCT

$ bioawk -c fastx '{ print $name, $seq, length($seq) }' < sequences.fa | awk 'OFS="\n" {if($3>0) print ">"$1, $2}'
chr2
TAGCTAGCT
ADD REPLY
1
Entering edit mode
$ bioawk -c fastx '{ print $name, $seq, length($seq) }' < test.fa | awk 'OFS="\n" {if($3>0) print ">"$1, $2}'

Output would be in fasta format.

ADD REPLY
0
Entering edit mode

Ah, forgot the damn >, good catch. Edited.

ADD REPLY
0
Entering edit mode
2.4 years ago

outstide python:

$ cutadapt -m 1 test.fa
$ awk -v RS='>' '{print ($2!="")?">"$1"\n"$2:""}' test.fa | awk NF
$ seqkit -w 0 seq -gm 1 test.fa

 >chr2
 TAGCTAGCT

For python:

from Bio import SeqIO
with open("new.fa", "w") as o:
    for current_seq in SeqIO.parse("test.fa", "fasta"):
        if len(current_seq.seq) != 0 :
            SeqIO.write(current_seq, o, "fasta")

test.fa is input fasta and new.fa is output fasta. Output fasta will be written in the same directory. If you are familiar with list comprehension in python, following is the way:

from Bio import SeqIO
old_seq=SeqIO.parse("test.fa", "fasta")
new_seq=[seq for seq in old_seq if len(seq.seq) != 0]
SeqIO.write(new_seq, "new.fa", "fasta")

Please read biopython documentation/tutorial. https://biopython.org/wiki/SeqIO. There is an example for filtering sequences by length.

ADD COMMENT

Login before adding your answer.

Traffic: 1365 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6