Question: Replace list of sequence names with sequences from a fasta file using R
0
gravatar for shelley.w.peterson
3 months ago by
shelley.w.peterson10 wrote:

I have a list of sequence names (A1, A2, A3, A1, A1, A2 etc) and a fasta file with the names and sequences, and I am trying to find a way to replace each item on the list with the corresponding sequence from the fasta file.

I've used:

test <- sequences[names(sequences) %in% list]

which just extracts A1, A2, A3 and doesn't give me the remaining ones. What am I missing?

list of sequence names:

A1
A2
A3
A1
A1
A2

fasta file:

>A1
ATCATC
>A2
CCCGGG
>A3
GTGTGT
>A4
TCTATC
>A5
ATCTAC

output:

>A1
ATCATC
>A2
CCCGGG
>A3
GTGTGT

Desired output:

ATCATC
CCCGGG
GTGTGT
ATCATC
ATCATC
CCCGGG
sequence R • 225 views
ADD COMMENTlink modified 3 months ago by RamRS26k • written 3 months ago by shelley.w.peterson10

Please give representative in/output.

ADD REPLYlink written 3 months ago by ATpoint31k

list of sequence names:

A1
A2
A3
A1
A1
A2

fasta file:

>A1
ATCATC
>A2
CCCGGG
>A3
GTGTGT
>A4
TCTATC
>A5
ATCTAC

output:

>A1
ATCATC
>A2
CCCGGG
>A3
GTGTGT

Desired output:

ATCATC
CCCGGG
GTGTGT
ATCATC
ATCATC
CCCGGG
ADD REPLYlink modified 3 months ago by RamRS26k • written 3 months ago by shelley.w.peterson10
  1. Please format your post better. I've done it for you this time.
  2. This shows what you have and what you need, but not what you've tried. Your single line of R code does not show how you read or write files, so we don't know the packages or functions you're using.
ADD REPLYlink written 3 months ago by RamRS26k

Is there an instruction segment for how to properly format a post? I was proud enough of myself for thinking to put in "< br >" when pressing the enter button didn't work. I'm a biologist not a programmer, so I don't know these things.

ADD REPLYlink modified 3 months ago • written 3 months ago by shelley.w.peterson10

Apologies, we do not have a manual for the formatting bar yet. You did a great job with the <br> tags, but the formatting bar is your toolbelt for most tasks.

ADD REPLYlink written 3 months ago by RamRS26k

Try dedicated fasta/fastq manipulation tools such as: seqtk, seqkit etc. @ shelley.w.peterson. R code as follows:

> library(Biostrings)
> test=readDNAStringSet("test.fa", format = "fasta")
> names=read.csv("file.txt", header = F, stringsAsFactors = F, strip.white = T)
 > names
  V1
1 A1
2 A2
3 A3
4 A1
5 A1
6 A2
> data.frame("sequences"=test[names$V1])
  sequences
1    ATCATC
2    CCCGGG
3    GTGTGT
4    ATCATC
5    ATCATC
6    CCCGGG
ADD REPLYlink modified 3 months ago • written 3 months ago by cpad011212k

Thanks so much!!!! As someone who is new to coding, sometimes it's hard to figure out if I'm using the wrong tool/command or if I'm using the correct one the wrong way -_-'

ADD REPLYlink written 3 months ago by shelley.w.peterson10
1

I started that way and learnt on the way. Keep visiting biostars @ shelley.w.peterson

ADD REPLYlink written 3 months ago by cpad011212k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2033 users visited in the last hour