Question

How to run different fasta files in a same process with its output respective

0

Entering edit mode

3.2 years ago

diego1530 ▴ 80

Dear Biostars,

I have this script based on bash to count the ambiguous nucleotides

input="arabidopsis.fasta"

output="ara.fasta"

while IFS= read line
do
echo $line | grep -v '>' | grep -o "[WSKMYRVHDBN]" | sort | uniq -c
echo $line | grep '>'
done < $input1 >> $output

However, as you can see, my script only read one input and in the same folder I have two more multifasta files. My desire is that all three files read at the same time, run the same process and as a result are three different outputs that correspond to each input. I would greatly appreciate any help in modifying my script to achieve my goal

bash sequence linux • 821 views

ADD COMMENT • link 3.2 years ago by diego1530 ▴ 80

0

Entering edit mode

This is because you have supplied only one input file in your code. Try Something like this in the directory where fasta files are located:

find . -type f -name "*.fa" | while read line; do  grep \> $line > ${line%.fa}.counts.txt ; grep -v \> $line | grep -o "[WSKMYRVHDBN]" | sort | uniq -c>>${line%.fa}.counts.txt; done

This would create new file for each fasta file and new extension ".counts.txt". (for eg. test.fa would generate test.counts.txt". Please format the output.

Suggestions for this loop:

catch the input files in array, then loop over the array or list the files and loop over the list
Do not hard code input or output unless necessary.

There is another way with seqkit, but it needs a little home work:

$ seqkit fx2tab -H -nl -B W -B S -B K -B M -B Y -B R -B V -B H -B D -B B -B N *.fa 

#name   length  W   S   K   M   Y   R   V   H   D   B   N
seq10   4   0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00
seq20   4   25.00   0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00
seq1    4   0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00
seq2    8   12.50   12.50   0.00    0.00    12.50   12.50   0.00    0.00    0.00    0.00    0.00

All the values are expressed as % against total length of the sequence. You need to convert each value into numbers from percentage and length of the sequence, for each base

ADD REPLY • link 3.2 years ago by cpad0112 21k

0

Entering edit mode

Hi, I really appreciate your help and I inform you that your script works very well. Although the results of the ambiguous nucleotides in each file are shown at the bottom without mentioning which sequence had that nucleotide. Is it possible to show me the ambiguous nucleotides in each sequence? I used the first method. Thanks!

ADD REPLY • link 3.2 years ago by diego1530 ▴ 80