How to find the fasta file with maxium number of amino acids?
3
1
Entering edit mode
13 months ago

Hi,

I'm learning to work with Unix for bioinformatics and I have a word file with some 60 proteins sequence pasted in it. is there a way to find which protein has the maximum number of amino acids and the one with the minimum?

Thank you

protein fasta sequence • 1.1k views
0
Entering edit mode

What have you tried?

0
Entering edit mode

Sorry for replying late, I recently installed Bioconductor on r and was trying to do it there instead of in the terminal of my mac, my file is saved in .fa format

1
Entering edit mode
13 months ago
Mensur Dlakic ★ 20k

If you are asking whether it can be done in the Word application, the answer is no. Most commonly used bioinformatics programs don't read Word files, so it would be difficult to do it in other programs as well using the format you have. However, if you convert your file into plain text, there are tools that can do the task. The exact way of doing it will depend on your comfort in using these tools since most of them are not "point and click" applications. If you tell us what you have tried and what resources are available to you, it would be easier to give advice.

It may help to go through the results of a simple Google search for shortest and longest sequence in fasta file.

1
Entering edit mode
13 months ago

Pretty sure I'm going to be downvoted for this post.

You can paste your protein sequences to Excel, use the LEN() function that returns the length of a string. You can then sort the sequences according to their lengths or use MIN() and MAX() function to get the minimum and maximum sequence size.

0
Entering edit mode

Nah, there is no downvote option here.

Here is my question: is there a simple way of pasting protein sequences, presumably in FASTA format, from Word to Excel? All the ways I can think of involve more work than doing it outside of Word/Excel. However, this is a way of doing it using only Microsoft applications, so it may be preferable to those who like them.

0
Entering edit mode

I can't think of an easy way to put sequences into an excel sheet. It would probably require multiple use of the "Find and replace" option. To be clear, my answer was a bit satirical.

0
Entering edit mode

OP didn't bother to say if the files were fasta format or not. There might be a way to make a macro to convert a fasta into single-line entries like that, but if one is going to learn how to program to do this, learning how to make macros might not be the best choice.

1
Entering edit mode
13 months ago
ATpoint 64k

Answering on your comment that you have it now in fasta format you can use either R with Biostrings (which you seem to be learning now) or awk.

Example data:

cat test.fa
>chr1
ATGCTAGCTAGCATCG
>chr2
TAGC
>chr3
GATCGATCGATCG
>chr4
TGACTGATCGACTAGCTAGCTACGTACGTACGATGCA
>chr5
GATCGATCGTACGATCG


1) Solution in R:

library(Biostrings)

#/ get shortest and longest via width()
w <- width(fa)
fa_final <- fa[c(which(w==min(w)), which(w==max(w)))]

#/ save back to disk:
writeXStringSet(fa_final, "test2.fa")


2) Solution with awk (people much better at awk than me can for sure squeeze this into a single command):

awk '/^>/ {printf("%s%s\t",(N>0?"\n":""),$0);N++;next;} {printf("%s",$0);} END {printf("\n");}' < test.fa \
| awk 'OFS="\t" {print $1,$2, length($2) | "sort -k3,3n"}' \ | awk '{ if(NR ==1){print$1"\n"$2 }}END {print$1"\n"\$2}'
>chr2
TAGC
>chr4
TGACTGATCGACTAGCTAGCTACGTACGTACGATGCA


First linearize the fasta (two columns tab separated), then print an additional column with the seq length, sort by length so shortest is the first and longest the last entry, then select first and last entry, and write back to fasta format.

0
Entering edit mode

Thank you, let me see if there are any tutorials on awk first for me to read this code better. Thanks again

0
Entering edit mode

I added one in R as well if you are interested.

0
Entering edit mode

oh yes, that one seems less complex. thank you so much