Rename entries of file_1 using their corresponding ids in file_2
2
0
Entering edit mode
8.7 years ago
hosseinv ▴ 20

Hi,

I have two files as following:

$cat file_1.fas >CHROM-g19-B-0001-66906-67533 ATTTGATTTCTCATGCTAAACATTTATTGGTG >CHROM-g19-B-0010-143637-144790 TCTGTCGACGGCAACTGTGAAACTTATCAGTG >CHROM-g19-B-0010-147754-150523 GCACCCTGAGCCGAACTGAATTCCTTGTGAT$ cat file_2.txt
A00120 CHROM-g19-B-0001-66906-67533
A00122 CHROM-g19-B-0010-143637-144790
A00124 CHROM-g19-B-0010-145875-146742
A00125 CHROM-g19-B-0010-147754-150523


I need to rename entries in file_1.fas with their corresponding ids in file_2.txt, to get the following;

$cat file_3.fas >A00120 ATTTGATTTCTCATGCTAAACATTTATTGGTG >A00122 TCTGTCGACGGCAACTGTGAAACTTATCAGTG >A00125 GCACCCTGAGCCGAACTGAATTCCTTGTGAT  NOTES: In my real data, file_2.txt has some more ids that can not be found in file_1.fas, and I don't need them either, because there will be no entries in file_1.fas to be replaced. Example will be A00124 CHROM-g19-B-0010-145875-146742 in file_2.txt. Thank you for helping me on this post. Hossein unix rename • 3.0k views ADD COMMENT 1 Entering edit mode What have you tried? What programming language(s) do you know? ADD REPLY 0 Entering edit mode I'm still in the beginning of scripting. Know a bit of shell, and perl. ADD REPLY 0 Entering edit mode If you're doing this with Perl or Python you'll want to look at reading the contents of file_2 into a "hash" or "dictionary" data structure. Then as you loop through the file_1 contents you can identify the header lines and then use them as "keys" to return the associated "value". ADD REPLY 2 Entering edit mode 8.7 years ago Sorry, I am addicted to R, but you could do this faster and efficient using Perl/Python/Ruby/Shell etc. Output >A00120 ATTTGATTTCTCATGCTAAACATTTATTGGTG >A00122 TCTGTCGACGGCAACTGTGAAACTTATCAGTG >A00125 GCACCCTGAGCCGAACTGAATTCCTTGTGAT  ADD COMMENT 0 Entering edit mode Thanks Sukhdeep Singh, I get an error at line 8, might be because I have an older version of R. I've done it somehow like the way Pierre wrote. Best ADD REPLY 0 Entering edit mode Whats the error, match function might be missing or might be syntax error, but it should work fine. :) ADD REPLY 0 Entering edit mode At line 9, it gives me the following warning: Warning message: In read.table(file = file, header = header, sep = sep, quote = quote, : incomplete final line found by readTableHeader on 'file_2.txt'  At line 12, I have this error below: Error in match(paste(">", b$V2, sep = ""), sub, nomatch = 0) :
'match' requires vector arguments

0
Entering edit mode

Solving first error might solve the second. Just open the file_2.txt in text editor, go to the last line and press ENTER, save it and repeat, it will work :)

0
Entering edit mode

I edited the second file in a text editor, and the warning message gone.

But, line 12 still gives me the error;

Error in match(paste(">", b$V2, sep = ""), sub, nomatch = 0) : 'match' requires vector arguments  ADD REPLY 0 Entering edit mode Sorry, my bad, I forgot to add one line sub=a$V1[seq(1,nrow(a),by=2)]


Above we subset the chrom identifiers only, match couldn't find sub

I will update my answer :)

0
Entering edit mode

Thank you for modifying the script. This time the code was run with no error, yet the output is slightly different from what should be. Here is the output by the code (please note the third entry)

>A00120
ATTTGATTTCTCATGCTAAACATTTATTGGTG
>A00122
TCTGTCGACGGCAACTGTGAAACTTATCAGTG
>A00124
GCACCCTGAGCCGAACTGAATTCCTTGTGAT


whereas it should be like this:

>A00120
ATTTGATTTCTCATGCTAAACATTTATTGGTG
>A00122
TCTGTCGACGGCAACTGTGAAACTTATCAGTG
>A00125
GCACCCTGAGCCGAACTGAATTCCTTGTGAT


The issue is cumming from line 15:

b=b[match(paste('>',b$V2,sep=''),sub,nomatch=0),]  Thanks again for help. ADD REPLY 1 Entering edit mode You are right, I updated it!! ADD REPLY 0 Entering edit mode THANK YOU, it works well now! Best, H ADD REPLY 0 Entering edit mode 8.7 years ago hints: linearize the fasta file, sort on the sequence: awk -F ' ' '/^>/ { printf("\n%s\t%s",$0,$1);next;} { printf("%s",$0);} END { printf("\n");}' | sort -t '  ' -k2,2


sort "file_2.txt" on the 2nd column use unix join to join both ouputs

convert the ouput of join back to fasta using awk.

0
Entering edit mode

Thank you Pierre,

I used simply the paste command and it's done.

Regards