Question

sort FASTA file into table

0

Entering edit mode

6.2 years ago

a.bolbukova.12 • 0

My fasta file looks like this

>seq0
FQTWEEFSRAAEKLYLADPMKVRVVLKYRHVDGNLCIKVTDDLVCLVYRTDQAQDVKKIEKF
>seq1
KYRTWEEFTRAAEKLYQADPMKVRVVLKYRHCDGNLCIKVTDDVVCLLYRTDQAQDVKKIEKFHSQLMRLME LKVTDNKECLKFKTDQAQEAKKMEKLNNIFFTLM
>seq2
EEYQTWEEFARAAEKLYLTDPMKVRVVLKYRHCDGNLCMKVTDDAVCLQYKTDQAQDVKKVEKLHGK
>seq3
MYQVWEEFSRAVEKLYLTDPMKVRVVLKYRHCDGNLCIKVTDNSVCLQYKTDQAQDVK
>seq4
EEFSRAVEKLYLTDPMKVRVVLKYRHCDGNLCIKVTDNSVVSYEMRLFGVQKDNFALEHSLL
>seq5
SWEEFAKAAEVLYLEDPMKCRMCTKYRHVDHKLVVKLTDNHTVLKYVTDMAQDVKKIEKLTTLLMR
>seq6
FTNWEEFAKAAERLHSANPEKCRFVTKYNHTKGELVLKLTDDVVCLQYSTNQLQDVKKLEKLSSTLLRSI
>seq7
SWEEFVERSVQLFRGDPNATRYVMKYRHCEGKLVLKVTDDRECLKFKTDQAQDAKKMEKLNNIFF
>seq8
SWDEFVDRSVQLFRADPESTRYVMKYRHCDGKLVLKVTDNKECLKFKTDQAQEAKKMEKLNNIFFTLM
>seq9
KNWEDFEIAAENMYMANPQNCRYTMKYVHSKGHILLKMSDNVKCVQYRAENMPDLKK
>seq10
FDSWDEFVSKSVELFRNHPDTTRYVVKYRHCEGKLVLKVTDNHECLKFKTDQAQDAKKMEK

how do I make it look like this (table) : .................................................................................................................................

Seq name    amino acid    occurance......................................
seq0               F                12% .......
                       A                     26% .
                       K                     60%

Seq1                T                      70% .
                        L                     50% .
                          W                   12% , etc

FASTA bash • 1.7k views

ADD COMMENT • link updated 12 months ago by Ram 43k • written 6.2 years ago by a.bolbukova.12 • 0

0

Entering edit mode

what have you tried ?

ADD REPLY • link 6.2 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

I added markup to your post for increased readability. You can do this by selecting the text and clicking the 101010 button. When you compose or edit a post that button is in your toolbar, see image below:

101010 Button

ADD REPLY • link 6.2 years ago by WouterDeCoster 47k

0

Entering edit mode

Hello, Thanks for that!

So far I have this script

esearch_outfile=esearch.txt # file for esearch results

unaligned=seqs.fa # file for sequences downloaded with efectch

aligned=seqs_ALIGNED.fa # file for sequences aligned by Clustal-Omega

echo RUNNING WGET FOR ESEARCH

wget "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=protein&term=((txid6231[Organism:exp]) AND COX1[Gene Name]) NOT partial&usehistory=y&retmode=json" -O $outfile

webenv_line=`grep webenv ${esearch_outfile}`

webenv=`echo $webenv_line | cut -f 4 -d '"'`

echo RUNNING WGET FOR EFETCH

wget "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?WebEnv=${webenv}&query_key=1&db=protein&rettype=fasta" -O $unaligned

echo RUNNING CLUSTALO

clustalo -i $unaligned > $aligned

echo LAUNCHING CLUSTALX2

clustalx2 $aligned &

echo LAUNCHING PLOTCON

plotcon -sequences $aligned -winsize 100 -graph x11 &

echo DONE

I have no idea how to make it into a table though. the

-outfmt 7

comand doesnt work...

ADD REPLY • link 6.2 years ago by a.bolbukova.12 • 0

0

Entering edit mode

AS far as I understand, this is the script you used to get the data. How have you tried to answer the current question ? "sort (sort ?) FASTA file into table"

ADD REPLY • link 6.2 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

#start sorting the file from here

grep -c "^>" file.fa #counts proteins in fasta file
echo ls |wc -l
echo 'NUMBER OF PROTEINS' 

sed 's/>.*/&protein/' file.fa > outfile.fa # adds word protein to end of all headers

awk '{print $2}' file.fa > output.fa #only second column of the header is outputted

sed -e '/^>/s/$/@/' -e 's/^>/#/' file.fasta | tr -d '\n' | tr "#" "\n" | tr "@" "\t" | sort -u -t $'\t' -f -k 2,2  | sed -e 's/^/>/' -e 's/\t/\n/' #removes all duplicated sequences

while read line;
    do if [ "${line:0:1}" == ">" ]; 
    then echo -e "\n"$line; else echo $line | tr -d '\n' ; fi; 
done < input.fasta > output.fasta

#Once linearized, to pick the sequence for the id 'x' you can use grep -A1 'x' output.fasta

-outfmt 7 gaps | pident |score #make table with column headings

column -t seq_ALIGNED.fa | less -S

grep -w "^++1" * | cut -f7 | awk '{sum += $1} END {print sum}'
grep -w "^++1" * | cut -f8 | awk '{sum += $1} END {print sum}'
grep -w "^++1" * | cut -f15 | awk '{sum += $1} END {print sum}'


awk 'BEGIN{RS=">"}NR>1{sub("\n","\t"); 
gsub("\n",""); 
print RS$0}' file #make table

sort -k2b,2 -k1,1 <sorted.txt #sorts descending order

This is the lis tof things I have tried so far

ADD REPLY • link 6.2 years ago by a.bolbukova.12 • 0

score 0 · Answer 1 · 2018-03-18

0

Entering edit mode

6.2 years ago

Pierre Lindenbaum 161k

assuming two lines per sequence record:, using awk:

awk '/^>/ {print; next;} {delete H; L=length($0);for(i=1;i<=L;i++) H[substr($0,i,1)]++; for(x in H) print x,int(H[x]/L*100.0);}' input.fasta

ADD COMMENT • link 6.2 years ago by Pierre Lindenbaum 161k