Question

copy files with >50 sequences to a folder

1

Entering edit mode

7.9 years ago

natasha ▴ 110

Hi

I have a folder with 4000 files in it. I would like make a new folder, and copy into it all of the files which contain >50 fasta sequences. How do I do this?

I know that I need to create a simple loop, then grep '>' | wc -l and select only those with >50. But I am new to programming and am unsure how to write this properly.

fasta loop Bash • 3.5k views

ADD COMMENT • link updated 7.9 years ago by Prakki Rama ★ 2.7k • written 7.9 years ago by natasha ▴ 110

0

Entering edit mode

There are a number of solutions posted below. Please choose as many you like as "acceptable" (use the check mark against the answers below). All of them should work for your question.

It is a good practice to accept "solutions" as it shows your appreciation for the effort people put in to bring solutions for your questions.

ADD REPLY • link 7.9 years ago by GenoMax 141k

2

Entering edit mode

7.9 years ago

pld 5.1k

from=$1
to=$2
minlen=$3
for f in $from
do
    if [ (grep ">" | wc -l) -gt $minlen ]
    then
        cp $f $to
    fi
done

Suppose you save this as move.sh, you'd run it as:

move.sh {directory with files} {directory you want them moved to} 50

ADD COMMENT • link 7.9 years ago by pld 5.1k

1

Entering edit mode

grep has an option -c to count the occurrence, so grep ">" | wc -l can be written as grep -ce ">"

ADD REPLY • link 7.9 years ago by Michael 54k

0

Entering edit mode

I don't know why I always forget grep can count.

ADD REPLY • link 7.9 years ago by pld 5.1k

0

Entering edit mode

Because grep being able to count is the UNIX equivalent of feature creep :P

ADD REPLY • link 7.9 years ago by John 13k

0

Entering edit mode

@joe: "mv" should be replaced with a "cp" to match the original request.

ADD REPLY • link 7.9 years ago by GenoMax 141k

0

Entering edit mode

Yep, fixed it. Really need to not post first thing in the morning.

ADD REPLY • link 7.9 years ago by pld 5.1k

1

Entering edit mode

7.9 years ago

5heikki 11k

mkdir newDir; for f in *.fna; do seqCount=$(grep -c "^>" $f); if [ "$seqCount" -gt 50 ]; then cp $f newDir; fi; done

ADD COMMENT • link 7.9 years ago by 5heikki 11k

1

Entering edit mode

7.9 years ago

Prakki Rama ★ 2.7k

Copying files with more than 50 sequences into another directory

 mkdir target_folder \| 
 cp -t target_folder $(LC_ALL=C fgrep -c '>' *.fasta | sed 's/:/\t/g' | awk '$2 > 50 {print $0}' | cut -f1)

1st line: mkdir creates a target folder

2nd line: fgrep counts the sequences in each file, AWK will choose those files with more than 50 sequences, cp -t command copies the files into target directory

In the case of moving files into a folder instead of copying, replace cp -t with mv -t

ADD COMMENT • link 7.9 years ago by Prakki Rama ★ 2.7k

0

Entering edit mode

Why is everyone posting a mv solution when the original request is for copy?

ADD REPLY • link 7.9 years ago by GenoMax 141k

0

Entering edit mode

exactly my point. It seems not everyone is reading the post properly.

ADD REPLY • link 7.9 years ago by ivivek_ngs ★ 5.2k

0

Entering edit mode

Original request is for more than 50 ("contain >50 fasta sequences"). A few solutions go with ≥ 50.

ADD REPLY • link 7.9 years ago by 5heikki 11k

0

Entering edit mode

@genomax2, @vchris_ngs : Sorry for overlooking! I updated it now. Thanks.

ADD REPLY • link 7.9 years ago by Prakki Rama ★ 2.7k

score 7 · Accepted Answer · 2016-06-13

7

Entering edit mode

7.9 years ago

venu 7.1k

Here is how

grep -c '^>' *.fasta | sed 's/:/ /' | awk '$2>50 {print $1}' | xargs mv -t new_folder/

Update

grep -c '^>' *.fasta | sed 's/:/ /' | awk '$2>50 {print $1}' | xargs cp new_folder/

P.S: Updated as @genomax2 pointed out.

ADD COMMENT • link 7.9 years ago by venu 7.1k

0

Entering edit mode

Since @natasha asked for "copy" xargs cp new_folder may be the better.

ADD REPLY • link 7.9 years ago by GenoMax 141k

0

Entering edit mode

Oops! Thanks for pointing out.

ADD REPLY • link 7.9 years ago by venu 7.1k

0

Entering edit mode

@venu it would be better if you just edit the command as @genomax pointed out.

ADD REPLY • link 7.9 years ago by ivivek_ngs ★ 5.2k

score 5 · Accepted Answer · 2016-06-13

5

Entering edit mode

7.9 years ago

Giovanni M Dall'Olio 28k

This is a very interesting question, but you should implement it without a FOR loop, which are a bad thing.

For example you can do it with find:

mkdir -p smallfiles_folder/
find .  -type f  -print0 | xargs -0 grep -m 50 -o -H -c '>'  | gawk -F":" '$2 > 2 {print $1} ' | grep -v 'smallfiles_folder' | xargs -i cp {} smallfiles_folder/

Explanation:

# find all files in current folder. Add -iname "*fasta" to restrict to fasta files only
find .  -type f  -print0 |     

  # count number of fasta headers in file. The -m 50 option is used to stop grepping after 50 matches, saving some calculation time) \
 xargs -0 grep -m 50 -o -H -c '^>'  | 

 # Use files to select files with less than 50 matches
 gawk -F":" '$2 < 50 {print $1} ' | 

 # Remove target folder (avoid potential recursive loops)
 grep -v 'smallfiles_folder' | 

 # copy / move files to target folder
 xargs -i cp {} smallfiles_folder/

ADD COMMENT • link 7.9 years ago by Giovanni M Dall'Olio 28k

1

Entering edit mode

Small nitpick. Even though the explanation part has it right the -m option for grep in actual command is set to 5 instead of 50. Interesting use of -m, makes sense.

Why would a for loop be bad?

ADD REPLY • link 7.9 years ago by GenoMax 141k

0

Entering edit mode

If I understand correctly it should be the computing time, if one is intending parallelization then the entire creation should be done using xargs rather than while or for loop that will process one sample at a time. So it not only the target folder creation but also the counting of the fasta headers are also in parallel. This is what I understand. Correct me if I am wrong.

ADD REPLY • link 7.9 years ago by ivivek_ngs ★ 5.2k

0

Entering edit mode

Yes, parallelization is the answer. In this case IO may still be a limiting factor, but as general rule it is better to get used to avoiding for loops in general.

ADD REPLY • link 7.9 years ago by Giovanni M Dall'Olio 28k

0

Entering edit mode

Would be interesting to see if there's any speed up with how IO heavy something like this is.

ADD REPLY • link 7.9 years ago by pld 5.1k

0

Entering edit mode

Thanks for spotting the typo, I've fixed it now.

ADD REPLY • link 7.9 years ago by Giovanni M Dall'Olio 28k

score 4 · Accepted Answer · 2016-06-13

4

Entering edit mode

7.9 years ago

Pierre Lindenbaum 161k

loop over the files in DIR1, use awk to print the files having more than 50 fasta sequences, read and copy those files

$ ls DIR1/*.fa | while read F; do  awk '/^>/ {N++;} END {if(N>50) printf("%s\n",FILENAME);}' "${F}" | while read F2; do cp "$F2" DIR2 ; done ; done

ADD COMMENT • link 7.9 years ago by Pierre Lindenbaum 161k