Question: Frequency of a unique sequence in a fasta file
0
gravatar for Tanvir Ahamed
4.2 years ago by
Sweden
Tanvir Ahamed 270 wrote:

Hi, 

I have a FASTA file containing sequences with id as header . The header for all sequence are unique but some sequences are similar to each other. (all.fasta)   

To get a FASTA file of unique sequences, i used  gt-sequniq commend . (unique.fasta) 

Now i want to find which unique sequence occurs how many times including the header name of similar sequences (Frequency of each unique sequence from all.fasta file ).

Thank you . 

sequencing fasta • 4.1k views
ADD COMMENTlink modified 4.2 years ago • written 4.2 years ago by Tanvir Ahamed 270
3
gravatar for Tanvir Ahamed
4.2 years ago by
Sweden
Tanvir Ahamed 270 wrote:

My solution : 

> Main file : all.fasta

> Unique sequence file, generated from main file : unique.fasta

 gt sequniq -o unique.fasta all.fasta 

> R function to find the similar sequence 

funiq<-function(all.fa,uniq.fa)
        {
        all<-unlist(all.fa)
        sequ.names <- names(uniq.fa)
        sequ <- NULL
        sequ[sequ.names] <- list(NULL)
        for(i in 1: length(uniq.fa))
            {
            sequ[[i]]<-names(all[which(all%in%(uniq.fa[[i]][1]))])
            }
        return(sequ)
        }

> Read FASTA file in R 

library(seqinr)
all<-read.fasta("all.fasta",as.string = TRUE,seqtype="AA")
uniq<-read.fasta("unique.fasta",as.string = TRUE,seqtype="AA")

Name of similar sequences : 

nam<- funiq(all,uniq)

Result : 

$ADC37925
[1] "ADC37925" "EYO75956" "EVE92773"

$AFR73793
[1] "AFR73793" "EPZ11191" "EPZ09632" "EQM92926" "EOR47863" "CAQ50233"

$EFB95474
[1] "EFB95474"

Frequency count : 

fcount<-lapply(nam,length)

Result : 

$ADC37925
[1] 3

$AFR73793
[1] 6

$EFB95474
[1] 1

ADD COMMENTlink written 4.2 years ago by Tanvir Ahamed 270

Well, that was rude! Anyway, glad you solved the problem.

ADD REPLYlink written 4.2 years ago by RamRS21k

Thanks for your help!!

ADD REPLYlink written 4.2 years ago by Tanvir Ahamed 270
1

You're welcome, but just so you know, adding your own answer as the accepted answer after people have given you suggestions and answers is rude.

ADD REPLYlink written 4.2 years ago by RamRS21k

Thanks for your suggestion. 

Definitely i highly expect and prefer peoples suggestions and answers. Thats why we share our problem and thinking.

But the the meaning of "Accepted " answer is most relevant solution of the problem/ question. 

is it ok to accept suggestion as solution or not to share my own solution here ? 

ADD REPLYlink written 4.2 years ago by Tanvir Ahamed 270

Your interpretation of "Accepted" is one of many possible interpretations, but the point is, as long as answers such as 5heikki's give you an optimal solution to your problem, and have been added in a relevant time frame, adding your own answer and accepting only your answer is equivalent to hosting a party, holding a contest and kinda rigging it to win it yourself. It is a crude anomaly, but you get the gist. It is not bad, just rude.

Consider accepting all answers that get you from your question to the desired answer and do not need much effort (in this case, 5heikki's). This is good for people that may have the same problem in the future, so they have choices when it comes to solutions.

ADD REPLYlink written 4.2 years ago by RamRS21k
3
gravatar for 5heikki
4.2 years ago by
5heikki8.4k
Finland
5heikki8.4k wrote:

So, you're looking for a clustering algorithm. I would recommend either Usearch or cd-hit..

ADD COMMENTlink written 4.2 years ago by 5heikki8.4k

I think its not a clustering algorithm !! its more about frequency distribution of each unique sequence . 

ADD REPLYlink written 4.2 years ago by Tanvir Ahamed 270

i.e. cluster size at 100% identity

ADD REPLYlink written 4.2 years ago by 5heikki8.4k

This is a MUCH better option.

ADD REPLYlink written 4.2 years ago by RamRS21k
3
gravatar for umer.zeeshan.ijaz
4.2 years ago by
Glasgow, UK
umer.zeeshan.ijaz1.7k wrote:

Step 1: Linearize your FASTA file

$ awk 'NR==1 {print ; next} {printf /^>/ ? "\n"$0"\n" : $1} END {print}' yourfastafile.fasta > yourfastafile_linearized.fasta

Step 2: Dereplicate your FASTA file and keep an account of unique sequences

$ grep -v "^>" yourfastafile_linearized.fasta | grep -v [^ACGTacgt] | sort -d | uniq -c | while read abundance sequence ; do hash=$(printf "${sequence}" | sha1sum); hash=${hash:0:40};printf ">%s;size=%d;\n%s\n" "${hash}" "${abundance}" "${sequence}"; done >  yourfastafile_linearized_dereplicated.fasta

The output will be in USEARCH format with size giving the frequency of unique reads:

$ head yourfastafile_linearized_dereplicated.fasta
>dff2e3c82439e27949c51a5e45acae06b4a2cafe;size=1;
AAAAAAAGTTTGAATTATGGCGAGAAATAAAAGTCTGAAACATGATTAAACTCCTAAGCAGAAAACCTACCGCGCTTCGCTTGGTCAACCCCTCAGCGGCAAAAATTAAAATTTTTACCGCTTCGGCGTTATAACCTCACACTCAATCTTTTATCACGAAGTCATGATTGAATCGCGAGTGGTCGGCAGATTGCGATAAACGGTCACATTAAATTTAAACTGACTATTCCACTGGAACCACTGAACGGAATCATGGTGGCGAATAAGTACGCGTTCTTGCAAATCACCAGAAGGCGGTTCCTGAATGAATGGGAAGCCTTCAAGAAGGTGATAAGCAGGAGAAACATACGAAGGCGCATAACGATACCACTGACCCTCAGCAATCTTAAACTTCTTAGACGAATCACCAGAACGGAAAACAGCCTTCATAGAAATTTCACGCGGCGG
>a6387c43c45ee62484eafb6c3301a8c0ed0fd546;size=30;
AAAAAAAGTTTGAATTATGGCGAGAAATAAAAGTCTGAAACATGATTAAACTCCTAAGCAGAAAACCTACCGCGCTTCGCTTGGTCAACCCCTCAGCGGCAAAAATTAAAATTTTTACCGCTTCGGCGTTATAACCTCACACTCAATCTTTTATCACGAAGTCATGATTGAATCGCGAGTGGTCGGCAGATTGCGATAAACGGTCACATTAAATTTAACCTGACTATTCCACTGCAACAACTGAACGGACTGGAAACACTGGTCATAATCATGGTGGCGAATAAGTACGCGTTCTTGCAAATCACCAGAAG
>66a3448cf6f606e3f843ea5f81d5a77d461bbf5e;size=13;
AAAAAAAGTTTGAATTATGGCGAGAAATAAAAGTCTGAAACATGATTAAACTCCTAAGCAGAAAACCTACCGCGCTTCGCTTGGTCAACCCCTCAGCGGCAAAAATTAAAATTTTTACCGCTTCGGCGTTATAACCTCACACTCAATCTTTTATCACGAAGTCATGATTGAATCGCGAGTGGTCGGCAGATTGCGATAAACGGTCACATTAAATTTAACCTGACTATTCCACTGCAACAACTGAACGGACTGGAAACACTGGTCATAATCATGGTGGCGAATAAGTACGCGTTCTTGCAAATCACCAGAAGGCGGTTCCTGAATGAATGGGAAGCCTTCA
>0746a116007d044ac40d6ead5d8fa029e034c9a6;size=1;

Best Wishes,

Umer
 

ADD COMMENTlink written 4.2 years ago by umer.zeeshan.ijaz1.7k

Thanks. 

But i am more interested about the members (name or ID) of each group.

ADD REPLYlink written 4.2 years ago by Tanvir Ahamed 270

Step #2 doesn't work

ADD REPLYlink written 2.2 years ago by Israel Barrantes740

When you say something doesn't work, it always helps to give details. A sample of your input, the exact command you ran, and any errors you faced, plus, most importantly, why you think it doesn't "work". Remember, it may seem like it doesn't work while it works perfectly fine.

ADD REPLYlink written 2.2 years ago by RamRS21k
0
gravatar for RamRS
4.2 years ago by
RamRS21k
Houston, TX
RamRS21k wrote:

I wrote a script that deals with the problem that gt-sequniq deals with. You can change it a bit to suit your case: https://github.com/RamRS/myPerlScripts/blob/master/mulFaDeDup.pl

Modify the existing hash to hold sequence:comma-separated list of headers as key:value

Then, add a new hash with header sequence as key and a counter as value, increase counter to number of times the sequence is found. Filter the final hash by value > 1 and write the keys out with their values map the remaining keys back to the previous hash.

This way, you can retain header info as well as get a count.

EDIT; Updated with a HUGE bug fix. .

ADD COMMENTlink modified 4.2 years ago • written 4.2 years ago by RamRS21k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1479 users visited in the last hour