Rename entries of file_1 using their corresponding ids in file_2
2
0
Entering edit mode
9.9 years ago
hosseinv ▴ 20

Hi,

I have two files as following:

$ cat file_1.fas
>CHROM-g19-B-0001-66906-67533
ATTTGATTTCTCATGCTAAACATTTATTGGTG
>CHROM-g19-B-0010-143637-144790
TCTGTCGACGGCAACTGTGAAACTTATCAGTG
>CHROM-g19-B-0010-147754-150523
GCACCCTGAGCCGAACTGAATTCCTTGTGAT

$ cat file_2.txt
A00120 CHROM-g19-B-0001-66906-67533
A00122 CHROM-g19-B-0010-143637-144790
A00124 CHROM-g19-B-0010-145875-146742
A00125 CHROM-g19-B-0010-147754-150523

I need to rename entries in file_1.fas with their corresponding ids in file_2.txt, to get the following;

$ cat file_3.fas
>A00120
ATTTGATTTCTCATGCTAAACATTTATTGGTG
>A00122
TCTGTCGACGGCAACTGTGAAACTTATCAGTG
>A00125
GCACCCTGAGCCGAACTGAATTCCTTGTGAT

NOTES:

In my real data, file_2.txt has some more ids that can not be found in file_1.fas, and I don't need them either, because there will be no entries in file_1.fas to be replaced. Example will be A00124 CHROM-g19-B-0010-145875-146742 in file_2.txt.

Thank you for helping me on this post.

Hossein

unix rename • 3.7k views
ADD COMMENT
1
Entering edit mode

What have you tried? What programming language(s) do you know?

ADD REPLY
0
Entering edit mode

I'm still in the beginning of scripting. Know a bit of shell, and perl.

ADD REPLY
0
Entering edit mode

If you're doing this with Perl or Python you'll want to look at reading the contents of `file_2` into a "hash" or "dictionary" data structure. Then as you loop through the `file_1` contents you can identify the header lines and then use them as "keys" to return the associated "value".

ADD REPLY
2
Entering edit mode
9.9 years ago

Sorry, I am addicted to R, but you could do this faster and efficient using Perl/Python/Ruby/Shell etc.

Output

>A00120
ATTTGATTTCTCATGCTAAACATTTATTGGTG
>A00122
TCTGTCGACGGCAACTGTGAAACTTATCAGTG
>A00125
GCACCCTGAGCCGAACTGAATTCCTTGTGAT
ADD COMMENT
0
Entering edit mode

Thanks Sukhdeep Singh,

I get an error at line 8, might be because I have an older version of R.

I've done it somehow like the way Pierre wrote.

Best

ADD REPLY
0
Entering edit mode

Whats the error, `match` function might be missing or might be syntax error, but it should work fine. :)

ADD REPLY
0
Entering edit mode

At line 9, it gives me the following warning:

Warning message:
In read.table(file = file, header = header, sep = sep, quote = quote,  :
  incomplete final line found by readTableHeader on 'file_2.txt'

At line 12, I have this error below:

Error in match(paste(">", b$V2, sep = ""), sub, nomatch = 0) : 
  'match' requires vector arguments
ADD REPLY
0
Entering edit mode

Solving first error might solve the second. Just open the file_2.txt in text editor, go to the last line and press ENTER, save it and repeat, it will work :)

ADD REPLY
0
Entering edit mode

I edited the second file in a text editor, and the warning message gone.

But, line 12 still gives me the error;

Error in match(paste(">", b$V2, sep = ""), sub, nomatch = 0) : 
  'match' requires vector arguments
ADD REPLY
0
Entering edit mode

Sorry, my bad, I forgot to add one line

sub=a$V1[seq(1,nrow(a),by=2)]

Above we subset the chrom identifiers only, match couldn't find sub

I will update my answer :)

ADD REPLY
0
Entering edit mode

Thank you for modifying the script. This time the code was run with no error, yet the output is slightly different from what should be. Here is the output by the code (please note the third entry)

>A00120
ATTTGATTTCTCATGCTAAACATTTATTGGTG
>A00122
TCTGTCGACGGCAACTGTGAAACTTATCAGTG
>A00124
GCACCCTGAGCCGAACTGAATTCCTTGTGAT

whereas it should be like this:

>A00120
ATTTGATTTCTCATGCTAAACATTTATTGGTG
>A00122
TCTGTCGACGGCAACTGTGAAACTTATCAGTG
>A00125
GCACCCTGAGCCGAACTGAATTCCTTGTGAT

The issue is cumming from line 15:

b=b[match(paste('>',b$V2,sep=''),sub,nomatch=0),]

Thanks again for help.

ADD REPLY
1
Entering edit mode

You are right, I updated it!!

ADD REPLY
0
Entering edit mode

THANK YOU, it works well now!

Best,
H

ADD REPLY
0
Entering edit mode
9.9 years ago

hints:

linearize the fasta file, sort on the sequence:

awk -F ' ' '/^>/ { printf("\n%s\t%s",$0,$1);next;} { printf("%s",$0);} END { printf("\n");}' | sort -t '  ' -k2,2

sort "file_2.txt" on the 2nd column use unix join to join both ouputs

convert the ouput of join back to fasta using awk.

ADD COMMENT
0
Entering edit mode

Thank you Pierre,

I used simply the paste command and it's done.

Regards

ADD REPLY

Login before adding your answer.

Traffic: 1719 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6