Extract fasta files that are not empty from a directory and subdirectories
1
0
Entering edit mode
5.1 years ago

Dear biostars,

I have a directory containing a few hundred subdirectories, each containing 7 fasta files. A significant fraction of these appear to be empty fasta files. Does anyone have an idea how to extract only those fasta sequences that are not empty?

Cheers,

Sam

fasta recursively • 1.6k views
ADD COMMENT
1
Entering edit mode

Will they be truly empty, or will they have no sequence, but a header? Wouter's approach may not work if thats the case.

ADD REPLY
0
Entering edit mode

You can use the unix find command to find empty files, see for example https://www.cyberciti.biz/faq/unix-linux-find-all-empty-files/

ADD REPLY
0
Entering edit mode

can you try this? samlambrechts299

 find . -type f -name "*.fa" -exec awk 'NR % 2 == 0 {if (length >=1) print FILENAME}' {} \;
ADD REPLY
2
Entering edit mode
5.1 years ago
ATpoint 81k

A naive solution would be to use find to list all fasta files in the current and subdirectories and then simply check if beyond the headers they have any content:

function FindNonEmpty {

  if [[ $(grep -v '^>' $1 | head | awk NF | wc -l) > 0 ]]
    then
      $(realpath $1 >> not_empty.txt)
      else $(realpath $1 >> empty.txt)
    fi
}; export -f FindNonEmpty

find ./ -maxdepth 1000 -name "*.fa" | parallel FindNonEmpty {}

not_empty.txt will list all non-empty and empty.txt all empty files.

ADD COMMENT

Login before adding your answer.

Traffic: 2741 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6