Question

Unique sequences in FASTA file (with count and IDs)

1

Entering edit mode

3.3 years ago

Explorer ▴ 10

I am trying to find unique sequences along with count and IDs from a FASTA file in R using Biostring. For exmaple

>random sequence 1
tatgtgcgag
>random sequence 2
agggtgttat
>random sequence 3
tatgtgcgag
>random sequence 4
gactcgcggt
>random sequence 5
tatgtgcgag
>random sequence 6
gcagccatcg
>random sequence 7
gactcgcggt
>random sequence 8
tatgtgcgag
>random sequence 9
tatgtgcgag
>random sequence 10
tatgtgcgag

The following code gives me a list of unique sequences

library(Biostrings)
random <- readDNAStringSet("random.fasta")
unique(random) 

DNAStringSet object of length 4:
    width seq                                                                  names               
[1]    10 TATGTGCGAG                                                           random sequence 1
[2]    10 AGGGTGTTAT                                                           random sequence 2
[3]    10 GACTCGCGGT                                                           random sequence 4
[4]    10 GCAGCCATCG                                                           random sequence 6

But I am not sure how to return “count” and “IDs” for each unique sequence and how to remove sequences with ambiguous characters. Can anyone help please? Thanks

R alignment sequence • 2.3k views

ADD COMMENT • link updated 3.3 years ago by zx8754 11k • written 3.3 years ago by Explorer ▴ 10

0

Entering edit mode

This operation might be a lot easier in bioawk. Do you absolutely need to use R? If so, I'd recommend using dplyr to group_by and summarise

ADD REPLY • link 3.3 years ago by Ram 43k

0

Entering edit mode

I am trying to learn R but if there is a simpler command in awk, I would really appreciate it if you may share.

ADD REPLY • link 3.3 years ago by Explorer ▴ 10

0

Entering edit mode

Did zx8754's solution work? Like I said, you could use awk but it will be more complicated. Even bioawk may not help if your identifiers have white spaces in them.

ADD REPLY • link 3.3 years ago by Ram 43k

score 1 · Answer 1 · 2021-01-06

Here is using base R:

# here reading example input, in your case you would use readLines("myFastaFile.txt")
x <- readLines(textConnection(">random sequence 1
tatgtgcgag
>random sequence 2
agggtgttat
>random sequence 3
tatgtgcgag
>random sequence 4
gactcgcggt
>random sequence 5
tatgtgcgag
>random sequence 6
gcagccatcg
>random sequence 7
gactcgcggt
>random sequence 8
tatgtgcgag
>random sequence 9
tatgtgcgag
>random sequence 10
tatgtgcgag"))

d <- data.frame(names = x[c(TRUE, FALSE)], seq = x[c(FALSE, TRUE)])

aggregate(names ~ seq, d, FUN = function(i)c(cnt = length(i), all = toString(i)))
#          seq names.cnt                                                                                                               names.all
# 1 agggtgttat         1                                                                                                      >random sequence 2
# 2 gactcgcggt         2                                                                                  >random sequence 4, >random sequence 7
# 3 gcagccatcg         1                                                                                                      >random sequence 6
# 4 tatgtgcgag         6 >random sequence 1, >random sequence 3, >random sequence 5, >random sequence 8, >random sequence 9, >random sequence 10