Question: Extract sequences from fastA files based on csv chart in R
0
gravatar for shelley.w.peterson
11 months ago by
shelley.w.peterson10 wrote:

I have 2 fasta files - we'll call them A and B, each with several thousand sequences labelled A1, A2, A3 or B1, B2, B3 etc.

I also have a .csv file that has 3 columns:

ST1   A1   B1
ST2   A1   B2
ST3   A3   B1

I need to combine these all into a single fasta file that looks like this:

ST1
"A1DNASEQUENCE""B1DNASEQUENCE"
ST2
"A1DNASEQUENCE""B2DNASEQUENCE"
ST3
"A3DNASEQUENCE""B1DNASEQUENCE"

I'm not even sure where to start with this one... anyone have any advice?

Thanks in advance!

sequencing R • 325 views
ADD COMMENTlink modified 11 months ago • written 11 months ago by shelley.w.peterson10

I guess that should be solvable by writing a script in biopython utilising the seqIO library. Alternatively, you can try it with command line tools like awk, cut, fasgrep, and join to loop over you CSV file.

ADD REPLYlink written 11 months ago by michael.ante3.6k

I realized I forgot to state that I specifically need this for R. I've actually already written this for perl and it works great, but I'm trying to rewrite it for R for some of my labmates who have trouble with command line programs. The ultimate goal is to rewrite all of my perl scripts to be used with R SHINY, but I'm not familiar enough with R yet to know all of the available methods of data manipulation and what I did in perl won't work in R.

ADD REPLYlink modified 11 months ago • written 11 months ago by shelley.w.peterson10

Apparently, there is an R-package for reading and accessing fasta files seqinr.

Once having the fasta as object, you it should be easy to apply your Perl routines.

ADD REPLYlink written 11 months ago by michael.ante3.6k
1
gravatar for yztxwd
11 months ago by
yztxwd380
Southern Medical University
yztxwd380 wrote:

A way by shell command, the idea is first get the ST index for each AX and BX sequence, then concatenate them according to the ST index.

test1.fa:

>A1
A1SEQUENCE
>A2
A2SEQUENCE
>A3
A3SEQUENCE

test2.fa:

>B1
B1SEQUENCE
>B2
B2SEQUENCE
>B3
B3SEQUENCE

test.csv (tab-delimited):

ST1 A1  B2
ST2 A2  B1
ST3 A2  B2
ST4 A3  B3

script:

# Join each fasta file with csv file to get the "ST" index of each sequence
join -t $'\t' -1 1 -2 2 -o 2.1,1.2 <(awk 'NR%2  {sub(/>/, "", $0); printf "%s\t", $0; next}1' test1.fa | sort -k1) <(sort -k2 test.csv) | sort -k3 > join.intermediate1
join -t $'\t' -1 1 -2 3  -o 2.1,1.2 <(awk 'NR%2  {sub(/>/, "", $0); printf "%s\t", $0; next}1' test2.fa | sort -k1) <(sort -k3 test.csv)  | sort -k3 > join.intermediate2
# Join the sequence with the same "ST" index
join -t $'\t' -j 1 join.intermediate1  join.intermediate2 | sed 's/^/>/' |sed 's/\t/\n/' | sed 's/\t//' > output.fa
# rm intermediate file
rm join.intermediate1 join.intermediate2

output.fa:

>ST1
A1SEQUENCEB2SEQUENCE
>ST2
A2SEQUENCEB1SEQUENCE
>ST3
A2SEQUENCEB2SEQUENCE
>ST4
A3SEQUENCEB3SEQUENCE

In addition, I agree with @michael.ante that you can also do this by python, for example, you can generate a dictionary with name and sequence as key and value, then concatenate the corresponding AX BX sequence for each STX. And it will be more readable, you will quickly forget what you did if you use shell like the above

ADD COMMENTlink modified 11 months ago • written 11 months ago by yztxwd380
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1006 users visited in the last hour