Question: sort FASTA file into table
0
gravatar for a.bolbukova.12
13 months ago by
a.bolbukova.120 wrote:

My fasta file looks like this

>seq0
FQTWEEFSRAAEKLYLADPMKVRVVLKYRHVDGNLCIKVTDDLVCLVYRTDQAQDVKKIEKF
>seq1
KYRTWEEFTRAAEKLYQADPMKVRVVLKYRHCDGNLCIKVTDDVVCLLYRTDQAQDVKKIEKFHSQLMRLME LKVTDNKECLKFKTDQAQEAKKMEKLNNIFFTLM
>seq2
EEYQTWEEFARAAEKLYLTDPMKVRVVLKYRHCDGNLCMKVTDDAVCLQYKTDQAQDVKKVEKLHGK
>seq3
MYQVWEEFSRAVEKLYLTDPMKVRVVLKYRHCDGNLCIKVTDNSVCLQYKTDQAQDVK
>seq4
EEFSRAVEKLYLTDPMKVRVVLKYRHCDGNLCIKVTDNSVVSYEMRLFGVQKDNFALEHSLL
>seq5
SWEEFAKAAEVLYLEDPMKCRMCTKYRHVDHKLVVKLTDNHTVLKYVTDMAQDVKKIEKLTTLLMR
>seq6
FTNWEEFAKAAERLHSANPEKCRFVTKYNHTKGELVLKLTDDVVCLQYSTNQLQDVKKLEKLSSTLLRSI
>seq7
SWEEFVERSVQLFRGDPNATRYVMKYRHCEGKLVLKVTDDRECLKFKTDQAQDAKKMEKLNNIFF
>seq8
SWDEFVDRSVQLFRADPESTRYVMKYRHCDGKLVLKVTDNKECLKFKTDQAQEAKKMEKLNNIFFTLM
>seq9
KNWEDFEIAAENMYMANPQNCRYTMKYVHSKGHILLKMSDNVKCVQYRAENMPDLKK
>seq10
FDSWDEFVSKSVELFRNHPDTTRYVVKYRHCEGKLVLKVTDNHECLKFKTDQAQDAKKMEK

how do I make it look like this (table) : .................................................................................................................................

Seq name    amino acid    occurance......................................
seq0               F                12% .......
                       A                     26% .
                       K                     60%

Seq1                T                      70% .
                        L                     50% .
                          W                   12% , etc
bash bioinformatics fasta • 536 views
ADD COMMENTlink modified 13 months ago by Pierre Lindenbaum119k • written 13 months ago by a.bolbukova.120

what have you tried ?

ADD REPLYlink written 13 months ago by Pierre Lindenbaum119k

I added markup to your post for increased readability. You can do this by selecting the text and clicking the 101010 button. When you compose or edit a post that button is in your toolbar, see image below:

101010 Button

ADD REPLYlink written 13 months ago by WouterDeCoster38k

Hello, Thanks for that!

So far I have this script

esearch_outfile=esearch.txt # file for esearch results

unaligned=seqs.fa # file for sequences downloaded with efectch

aligned=seqs_ALIGNED.fa # file for sequences aligned by Clustal-Omega

echo RUNNING WGET FOR ESEARCH

wget "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=protein&term=((txid6231[Organism:exp]) AND COX1[Gene Name]) NOT partial&usehistory=y&retmode=json" -O $outfile

webenv_line=`grep webenv ${esearch_outfile}`

webenv=`echo $webenv_line | cut -f 4 -d '"'`

echo RUNNING WGET FOR EFETCH

wget "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?WebEnv=${webenv}&query_key=1&db=protein&rettype=fasta" -O $unaligned

echo RUNNING CLUSTALO

clustalo -i $unaligned > $aligned

echo LAUNCHING CLUSTALX2

clustalx2 $aligned &

echo LAUNCHING PLOTCON

plotcon -sequences $aligned -winsize 100 -graph x11 &

echo DONE

I have no idea how to make it into a table though. the

-outfmt 7

comand doesnt work...

ADD REPLYlink written 13 months ago by a.bolbukova.120

AS far as I understand, this is the script you used to get the data. How have you tried to answer the current question ? "sort (sort ?) FASTA file into table"

ADD REPLYlink written 13 months ago by Pierre Lindenbaum119k
#start sorting the file from here

grep -c "^>" file.fa #counts proteins in fasta file
echo ls |wc -l
echo 'NUMBER OF PROTEINS' 

sed 's/>.*/&protein/' file.fa > outfile.fa # adds word protein to end of all headers

awk '{print $2}' file.fa > output.fa #only second column of the header is outputted

sed -e '/^>/s/$/@/' -e 's/^>/#/' file.fasta | tr -d '\n' | tr "#" "\n" | tr "@" "\t" | sort -u -t $'\t' -f -k 2,2  | sed -e 's/^/>/' -e 's/\t/\n/' #removes all duplicated sequences

while read line;
    do if [ "${line:0:1}" == ">" ]; 
    then echo -e "\n"$line; else echo $line | tr -d '\n' ; fi; 
done < input.fasta > output.fasta

#Once linearized, to pick the sequence for the id 'x' you can use grep -A1 'x' output.fasta

-outfmt 7 gaps | pident |score #make table with column headings

column -t seq_ALIGNED.fa | less -S

grep -w "^++1" * | cut -f7 | awk '{sum += $1} END {print sum}'
grep -w "^++1" * | cut -f8 | awk '{sum += $1} END {print sum}'
grep -w "^++1" * | cut -f15 | awk '{sum += $1} END {print sum}'


awk 'BEGIN{RS=">"}NR>1{sub("\n","\t"); 
gsub("\n",""); 
print RS$0}' file #make table

sort -k2b,2 -k1,1 <sorted.txt #sorts descending order

This is the lis tof things I have tried so far

ADD REPLYlink written 13 months ago by a.bolbukova.120
0
gravatar for Pierre Lindenbaum
13 months ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum119k wrote:

assuming two lines per sequence record:, using awk:

awk '/^>/ {print; next;} {delete H; L=length($0);for(i=1;i<=L;i++) H[substr($0,i,1)]++; for(x in H) print x,int(H[x]/L*100.0);}' input.fasta
ADD COMMENTlink written 13 months ago by Pierre Lindenbaum119k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1448 users visited in the last hour