Unique sequences in FASTA file (with count and IDs)
1
1
Entering edit mode
3.3 years ago
Explorer ▴ 10

I am trying to find unique sequences along with count and IDs from a FASTA file in R using Biostring. For exmaple

>random sequence 1
tatgtgcgag
>random sequence 2
agggtgttat
>random sequence 3
tatgtgcgag
>random sequence 4
gactcgcggt
>random sequence 5
tatgtgcgag
>random sequence 6
gcagccatcg
>random sequence 7
gactcgcggt
>random sequence 8
tatgtgcgag
>random sequence 9
tatgtgcgag
>random sequence 10
tatgtgcgag

The following code gives me a list of unique sequences

library(Biostrings)
random <- readDNAStringSet("random.fasta")
unique(random) 

DNAStringSet object of length 4:
    width seq                                                                  names               
[1]    10 TATGTGCGAG                                                           random sequence 1
[2]    10 AGGGTGTTAT                                                           random sequence 2
[3]    10 GACTCGCGGT                                                           random sequence 4
[4]    10 GCAGCCATCG                                                           random sequence 6

But I am not sure how to return “count” and “IDs” for each unique sequence and how to remove sequences with ambiguous characters. Can anyone help please? Thanks

R alignment sequence • 2.3k views
ADD COMMENT
0
Entering edit mode

This operation might be a lot easier in bioawk. Do you absolutely need to use R? If so, I'd recommend using dplyr to group_by and summarise

ADD REPLY
0
Entering edit mode

I am trying to learn R but if there is a simpler command in awk, I would really appreciate it if you may share.

ADD REPLY
0
Entering edit mode

Did zx8754's solution work? Like I said, you could use awk but it will be more complicated. Even bioawk may not help if your identifiers have white spaces in them.

ADD REPLY
1
Entering edit mode
3.3 years ago
zx8754 11k

Here is using base R:

# here reading example input, in your case you would use readLines("myFastaFile.txt")
x <- readLines(textConnection(">random sequence 1
tatgtgcgag
>random sequence 2
agggtgttat
>random sequence 3
tatgtgcgag
>random sequence 4
gactcgcggt
>random sequence 5
tatgtgcgag
>random sequence 6
gcagccatcg
>random sequence 7
gactcgcggt
>random sequence 8
tatgtgcgag
>random sequence 9
tatgtgcgag
>random sequence 10
tatgtgcgag"))

d <- data.frame(names = x[c(TRUE, FALSE)], seq = x[c(FALSE, TRUE)])

aggregate(names ~ seq, d, FUN = function(i)c(cnt = length(i), all = toString(i)))
#          seq names.cnt                                                                                                               names.all
# 1 agggtgttat         1                                                                                                      >random sequence 2
# 2 gactcgcggt         2                                                                                  >random sequence 4, >random sequence 7
# 3 gcagccatcg         1                                                                                                      >random sequence 6
# 4 tatgtgcgag         6 >random sequence 1, >random sequence 3, >random sequence 5, >random sequence 8, >random sequence 9, >random sequence 10
ADD COMMENT
0
Entering edit mode

Thank you. Changing data from DNAstring to dataframe works perfectly. I am wondering is there a way to determine frequency along with count in the code?

ADD REPLY
1
Entering edit mode

Yes, just divide by all number of rows:

aggregate(names ~ seq, d, FUN = function(i) c(cnt = length(i), freq = length(i)/nrow(d), all = toString(i)))
#          seq names.cnt names.freq                                                                                                               names.all
# 1 agggtgttat         1        0.1                                                                                                      >random sequence 2
# 2 gactcgcggt         2        0.2                                                                                  >random sequence 4, >random sequence 7
# 3 gcagccatcg         1        0.1                                                                                                      >random sequence 6
# 4 tatgtgcgag         6        0.6 >random sequence 1, >random sequence 3, >random sequence 5, >random sequence 8, >random sequence 9, >random sequence 10
ADD REPLY

Login before adding your answer.

Traffic: 2405 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6