Question: Unique sequences in FASTA file (with count and IDs)
1
gravatar for Explorer
20 days ago by
Explorer10
Canada
Explorer10 wrote:

I am trying to find unique sequences along with count and IDs from a FASTA file in R using Biostring. For exmaple

>random sequence 1
tatgtgcgag
>random sequence 2
agggtgttat
>random sequence 3
tatgtgcgag
>random sequence 4
gactcgcggt
>random sequence 5
tatgtgcgag
>random sequence 6
gcagccatcg
>random sequence 7
gactcgcggt
>random sequence 8
tatgtgcgag
>random sequence 9
tatgtgcgag
>random sequence 10
tatgtgcgag

The following code gives me a list of unique sequences

library(Biostrings)
random <- readDNAStringSet("random.fasta")
unique(random) 

DNAStringSet object of length 4:
    width seq                                                                  names               
[1]    10 TATGTGCGAG                                                           random sequence 1
[2]    10 AGGGTGTTAT                                                           random sequence 2
[3]    10 GACTCGCGGT                                                           random sequence 4
[4]    10 GCAGCCATCG                                                           random sequence 6

But I am not sure how to return “count” and “IDs” for each unique sequence and how to remove sequences with ambiguous characters. Can anyone help please? Thanks

alignment sequence R • 133 views
ADD COMMENTlink modified 20 days ago by zx87549.9k • written 20 days ago by Explorer10

This operation might be a lot easier in bioawk. Do you absolutely need to use R? If so, I'd recommend using dplyr to group_by and summarise

ADD REPLYlink written 20 days ago by _r_am32k

I am trying to learn R but if there is a simpler command in awk, I would really appreciate it if you may share.

ADD REPLYlink written 12 days ago by Explorer10

Did zx8754's solution work? Like I said, you could use awk but it will be more complicated. Even bioawk may not help if your identifiers have white spaces in them.

ADD REPLYlink modified 12 days ago • written 12 days ago by _r_am32k
1
gravatar for zx8754
20 days ago by
zx87549.9k
London
zx87549.9k wrote:

Here is using base R:

# here reading example input, in your case you would use readLines("myFastaFile.txt")
x <- readLines(textConnection(">random sequence 1
tatgtgcgag
>random sequence 2
agggtgttat
>random sequence 3
tatgtgcgag
>random sequence 4
gactcgcggt
>random sequence 5
tatgtgcgag
>random sequence 6
gcagccatcg
>random sequence 7
gactcgcggt
>random sequence 8
tatgtgcgag
>random sequence 9
tatgtgcgag
>random sequence 10
tatgtgcgag"))

d <- data.frame(names = x[c(TRUE, FALSE)], seq = x[c(FALSE, TRUE)])

aggregate(names ~ seq, d, FUN = function(i)c(cnt = length(i), all = toString(i)))
#          seq names.cnt                                                                                                               names.all
# 1 agggtgttat         1                                                                                                      >random sequence 2
# 2 gactcgcggt         2                                                                                  >random sequence 4, >random sequence 7
# 3 gcagccatcg         1                                                                                                      >random sequence 6
# 4 tatgtgcgag         6 >random sequence 1, >random sequence 3, >random sequence 5, >random sequence 8, >random sequence 9, >random sequence 10
ADD COMMENTlink written 20 days ago by zx87549.9k

Thank you. Changing data from DNAstring to dataframe works perfectly. I am wondering is there a way to determine frequency along with count in the code?

ADD REPLYlink written 19 days ago by Explorer10
1

Yes, just divide by all number of rows:

aggregate(names ~ seq, d, FUN = function(i) c(cnt = length(i), freq = length(i)/nrow(d), all = toString(i)))
#          seq names.cnt names.freq                                                                                                               names.all
# 1 agggtgttat         1        0.1                                                                                                      >random sequence 2
# 2 gactcgcggt         2        0.2                                                                                  >random sequence 4, >random sequence 7
# 3 gcagccatcg         1        0.1                                                                                                      >random sequence 6
# 4 tatgtgcgag         6        0.6 >random sequence 1, >random sequence 3, >random sequence 5, >random sequence 8, >random sequence 9, >random sequence 10
ADD REPLYlink written 19 days ago by zx87549.9k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1585 users visited in the last hour
_