Frequency of a unique sequence in a fasta file
4
0
Entering edit mode
6.9 years ago

Hi,

I have a FASTA file containing sequences with id as header . The header for all sequence are unique but some sequences are similar to each other. (all.fasta)

To get a FASTA file of unique sequences, i used  gt-sequniq commend . (unique.fasta)

Now i want to find which unique sequence occurs how many times including the header name of similar sequences (Frequency of each unique sequence from all.fasta file ).

Thank you .

sequencing FASTA • 6.1k views
3
Entering edit mode
6.9 years ago

My solution :

> Main file : all.fasta

> Unique sequence file, generated from main file : unique.fasta

 gt sequniq -o unique.fasta all.fasta

> R function to find the similar sequence

funiq<-function(all.fa,uniq.fa)
{
all<-unlist(all.fa)
sequ.names <- names(uniq.fa)
sequ <- NULL
sequ[sequ.names] <- list(NULL)
for(i in 1: length(uniq.fa))
{
sequ[[i]]<-names(all[which(all%in%(uniq.fa[[i]][1]))])
}
return(sequ)
}

> Read FASTA file in R

library(seqinr)
uniq<-read.fasta("unique.fasta",as.string = TRUE,seqtype="AA")

Name of similar sequences :

nam<- funiq(all,uniq)

Result :

$ADC37925 [1] "ADC37925" "EYO75956" "EVE92773"$AFR73793
[1] "AFR73793" "EPZ11191" "EPZ09632" "EQM92926" "EOR47863" "CAQ50233"

$EFB95474 [1] "EFB95474" Frequency count : fcount<-lapply(nam,length) Result :$ADC37925
[1] 3

$AFR73793 [1] 6$EFB95474
[1] 1

0
Entering edit mode

Well, that was rude! Anyway, glad you solved the problem.

0
Entering edit mode

1
Entering edit mode

0
Entering edit mode

Definitely i highly expect and prefer peoples suggestions and answers. Thats why we share our problem and thinking.

But the the meaning of "Accepted " answer is most relevant solution of the problem/ question.

is it ok to accept suggestion as solution or not to share my own solution here ?

0
Entering edit mode

Your interpretation of "Accepted" is one of many possible interpretations, but the point is, as long as answers such as 5heikki's give you an optimal solution to your problem, and have been added in a relevant time frame, adding your own answer and accepting only your answer is equivalent to hosting a party, holding a contest and kinda rigging it to win it yourself. It is a crude anomaly, but you get the gist. It is not bad, just rude.

Consider accepting all answers that get you from your question to the desired answer and do not need much effort (in this case, 5heikki's). This is good for people that may have the same problem in the future, so they have choices when it comes to solutions.

3
Entering edit mode
6.9 years ago
5heikki 10k

So, you're looking for a clustering algorithm. I would recommend either Usearch or cd-hit..

0
Entering edit mode

I think its not a clustering algorithm !! its more about frequency distribution of each unique sequence .

0
Entering edit mode

i.e. cluster size at 100% identity

0
Entering edit mode

This is a MUCH better option.

3
Entering edit mode
6.9 years ago

Step 1: Linearize your FASTA file

$awk 'NR==1 {print ; next} {printf /^>/ ? "\n"$0"\n" : $1} END {print}' yourfastafile.fasta > yourfastafile_linearized.fasta Step 2: Dereplicate your FASTA file and keep an account of unique sequences $ grep -v "^>" yourfastafile_linearized.fasta | grep -v [^ACGTacgt] | sort -d | uniq -c | while read abundance sequence ; do hash=$(printf "${sequence}" | sha1sum); hash=${hash:0:40};printf ">%s;size=%d;\n%s\n" "${hash}" "${abundance}" "${sequence}"; done >  yourfastafile_linearized_dereplicated.fasta

The output will be in USEARCH format with size giving the frequency of unique reads:

\$ head yourfastafile_linearized_dereplicated.fasta
>dff2e3c82439e27949c51a5e45acae06b4a2cafe;size=1;
AAAAAAAGTTTGAATTATGGCGAGAAATAAAAGTCTGAAACATGATTAAACTCCTAAGCAGAAAACCTACCGCGCTTCGCTTGGTCAACCCCTCAGCGGCAAAAATTAAAATTTTTACCGCTTCGGCGTTATAACCTCACACTCAATCTTTTATCACGAAGTCATGATTGAATCGCGAGTGGTCGGCAGATTGCGATAAACGGTCACATTAAATTTAAACTGACTATTCCACTGGAACCACTGAACGGAATCATGGTGGCGAATAAGTACGCGTTCTTGCAAATCACCAGAAGGCGGTTCCTGAATGAATGGGAAGCCTTCAAGAAGGTGATAAGCAGGAGAAACATACGAAGGCGCATAACGATACCACTGACCCTCAGCAATCTTAAACTTCTTAGACGAATCACCAGAACGGAAAACAGCCTTCATAGAAATTTCACGCGGCGG
>a6387c43c45ee62484eafb6c3301a8c0ed0fd546;size=30;
AAAAAAAGTTTGAATTATGGCGAGAAATAAAAGTCTGAAACATGATTAAACTCCTAAGCAGAAAACCTACCGCGCTTCGCTTGGTCAACCCCTCAGCGGCAAAAATTAAAATTTTTACCGCTTCGGCGTTATAACCTCACACTCAATCTTTTATCACGAAGTCATGATTGAATCGCGAGTGGTCGGCAGATTGCGATAAACGGTCACATTAAATTTAACCTGACTATTCCACTGCAACAACTGAACGGACTGGAAACACTGGTCATAATCATGGTGGCGAATAAGTACGCGTTCTTGCAAATCACCAGAAG
>66a3448cf6f606e3f843ea5f81d5a77d461bbf5e;size=13;
AAAAAAAGTTTGAATTATGGCGAGAAATAAAAGTCTGAAACATGATTAAACTCCTAAGCAGAAAACCTACCGCGCTTCGCTTGGTCAACCCCTCAGCGGCAAAAATTAAAATTTTTACCGCTTCGGCGTTATAACCTCACACTCAATCTTTTATCACGAAGTCATGATTGAATCGCGAGTGGTCGGCAGATTGCGATAAACGGTCACATTAAATTTAACCTGACTATTCCACTGCAACAACTGAACGGACTGGAAACACTGGTCATAATCATGGTGGCGAATAAGTACGCGTTCTTGCAAATCACCAGAAGGCGGTTCCTGAATGAATGGGAAGCCTTCA
>0746a116007d044ac40d6ead5d8fa029e034c9a6;size=1;

Best Wishes,

Umer

0
Entering edit mode

Thanks.

But i am more interested about the members (name or ID) of each group.

0
Entering edit mode

Step #2 doesn't work

0
Entering edit mode

When you say something doesn't work, it always helps to give details. A sample of your input, the exact command you ran, and any errors you faced, plus, most importantly, why you think it doesn't "work". Remember, it may seem like it doesn't work while it works perfectly fine.

0
Entering edit mode
6.9 years ago
Ram 35k

I wrote a script that deals with the problem that gt-sequniq deals with. You can change it a bit to suit your case: https://github.com/RamRS/myPerlScripts/blob/master/mulFaDeDup.pl

Modify the existing hash to hold sequence:comma-separated list of headers as key:value

Then, add a new hash with header sequence as key and a counter as value, increase counter to number of times the sequence is found. Filter the final hash by value > 1 and write the keys out with their values map the remaining keys back to the previous hash.

This way, you can retain header info as well as get a count.

EDIT; Updated with a HUGE bug fix. .