Question: Unique sequences in FASTA file (with count and IDs)

1

Explorer •

**10**wrote:I am trying to find unique sequences along with count and IDs from a FASTA file in R using Biostring. For exmaple

```
>random sequence 1
tatgtgcgag
>random sequence 2
agggtgttat
>random sequence 3
tatgtgcgag
>random sequence 4
gactcgcggt
>random sequence 5
tatgtgcgag
>random sequence 6
gcagccatcg
>random sequence 7
gactcgcggt
>random sequence 8
tatgtgcgag
>random sequence 9
tatgtgcgag
>random sequence 10
tatgtgcgag
```

The following code gives me a list of unique sequences

```
library(Biostrings)
random <- readDNAStringSet("random.fasta")
unique(random)
DNAStringSet object of length 4:
width seq names
[1] 10 TATGTGCGAG random sequence 1
[2] 10 AGGGTGTTAT random sequence 2
[3] 10 GACTCGCGGT random sequence 4
[4] 10 GCAGCCATCG random sequence 6
```

But I am not sure how to return “count” and “IDs” for each unique sequence and how to remove sequences with ambiguous characters. Can anyone help please? Thanks

This operation might be a lot easier in bioawk. Do you absolutely need to use R? If so, I'd recommend using dplyr to

`group_by`

and`summarise`

32kI am trying to learn R but if there is a simpler command in awk, I would really appreciate it if you may share.

10Did zx8754's solution work? Like I said, you could use awk but it will be more complicated. Even bioawk may not help if your identifiers have white spaces in them.

32k