Append sequence based on sequence ID
2
1
Entering edit mode
9.2 years ago
venu 7.1k

Hello,

I've 2 directories containing more than 8000 protein fasta sequence files each. Both directories contain some sequence files (ids only, not the sequence present in it) in common.

Example:

dir2 - Q9Y0E7_LEIDO.fasta
       Q9Y0A8_CRYPV.fasta
dir1 - Q9Y0E7_LEIDO.fasta
       Q9Y0A8_CRYPV.fasta

Now I want to append the sequence present in a file in dir2 to the file that has the same name in the dir1.

My effort was going nowhere. Here I am expecting a solution for this.

one important point - while appending to the first sequence, the header of second one should be avoided.

python unix script perl • 2.2k views
ADD COMMENT
2
Entering edit mode

My effort was going nowhere. Here I am expecting a solution for this.

Always post what have you tried so that people can quickly help you out in fixing your command/code instead of doing it from scratch for you.

ADD REPLY
3
Entering edit mode
9.2 years ago
Ram 43k

Here's the approach I would take:

  1. Loop thru files in dir2
  2. tail -n +2 each file and append to same-named file in dir-1

This should be easy enough to implement, you might have to check if same-named file exists for each file you pick up in the loop, that should be easy too.

I'd much prefer if you tried coding this yourself. You won't need Perl or Python, this can be achieved using bash alone.

As a fail safe, before you begin this, chmod 400 all the files involved and print the output to stdout. Only when you see stuff working perfectly should you change the files back to the permissions they had earlier (or any convenient permission set) and use it to actually change the file content.

ADD COMMENT
1
Entering edit mode
9.2 years ago

Extending Ram's approach:

First keep a backup of your files and try:

for fasta in `python -c 'import os; dir1=[fasta for fasta in os.listdir("`dir1`")]; dir2=[fasta for fasta in os.listdir("`dir2`") ]; print " ".join(list(set(dir1).intersection(dir2)))'`
do
tail -n +2 `dir2`/$fasta >> `dir1`/$fasta
done

dir1/dir2should named according to your directory names.

If you still want the sequence IDs to be appended, replace tail -n +2 with cat

ADD COMMENT
0
Entering edit mode

I did not want to give him code for obvious reasons.

ADD REPLY
0
Entering edit mode

Working:: Thank you::

ADD REPLY
1
Entering edit mode

But please try to do it in unix yourself. Otherwise you won't learn.

ADD REPLY

Login before adding your answer.

Traffic: 2717 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6