Question

Rename entries of file_1 using their corresponding ids in file_2

0

Entering edit mode

10.8 years ago

hosseinv ▴ 20

Hi,

I have two files as following:

$ cat file_1.fas
>CHROM-g19-B-0001-66906-67533
ATTTGATTTCTCATGCTAAACATTTATTGGTG
>CHROM-g19-B-0010-143637-144790
TCTGTCGACGGCAACTGTGAAACTTATCAGTG
>CHROM-g19-B-0010-147754-150523
GCACCCTGAGCCGAACTGAATTCCTTGTGAT

$ cat file_2.txt
A00120 CHROM-g19-B-0001-66906-67533
A00122 CHROM-g19-B-0010-143637-144790
A00124 CHROM-g19-B-0010-145875-146742
A00125 CHROM-g19-B-0010-147754-150523

I need to rename entries in file_1.fas with their corresponding ids in file_2.txt, to get the following;

$ cat file_3.fas
>A00120
ATTTGATTTCTCATGCTAAACATTTATTGGTG
>A00122
TCTGTCGACGGCAACTGTGAAACTTATCAGTG
>A00125
GCACCCTGAGCCGAACTGAATTCCTTGTGAT

NOTES:

In my real data, file_2.txt has some more ids that can not be found in file_1.fas, and I don't need them either, because there will be no entries in file_1.fas to be replaced. Example will be A00124 CHROM-g19-B-0010-145875-146742 in file_2.txt.

Thank you for helping me on this post.

Hossein

unix rename • 4.3k views

ADD COMMENT • link updated 3.4 years ago by Ram 44k • written 10.8 years ago by hosseinv ▴ 20

1

Entering edit mode

What have you tried? What programming language(s) do you know?

ADD REPLY • link 10.8 years ago by Emily 24k

0

Entering edit mode

I'm still in the beginning of scripting. Know a bit of shell, and perl.

ADD REPLY • link 10.8 years ago by hosseinv ▴ 20

0

Entering edit mode

If you're doing this with Perl or Python you'll want to look at reading the contents of `file_2` into a "hash" or "dictionary" data structure. Then as you loop through the `file_1` contents you can identify the header lines and then use them as "keys" to return the associated "value".

ADD REPLY • link 10.8 years ago by Matt Shirley 10k

0

Entering edit mode

10.8 years ago

Pierre Lindenbaum 165k

hints:

linearize the fasta file, sort on the sequence:

awk -F ' ' '/^>/ { printf("\n%s\t%s",$0,$1);next;} { printf("%s",$0);} END { printf("\n");}' | sort -t '  ' -k2,2

sort "file_2.txt" on the 2nd column use unix join to join both ouputs

convert the ouput of join back to fasta using awk.

ADD COMMENT • link updated 5.1 years ago by Ram 44k • written 10.8 years ago by Pierre Lindenbaum 165k

0

Entering edit mode

Thank you Pierre,

I used simply the paste command and it's done.

Regards

ADD REPLY • link 10.8 years ago by hosseinv ▴ 20

Ram · Accepted Answer · 2014-05-13

2

Entering edit mode

10.8 years ago

Sukhi Singh 11k

Sorry, I am addicted to R, but you could do this faster and efficient using Perl/Python/Ruby/Shell etc.

Output

>A00120
ATTTGATTTCTCATGCTAAACATTTATTGGTG
>A00122
TCTGTCGACGGCAACTGTGAAACTTATCAGTG
>A00125
GCACCCTGAGCCGAACTGAATTCCTTGTGAT

ADD COMMENT • link updated 5.1 years ago by Ram 44k • written 10.8 years ago by Sukhi Singh 11k

0

Entering edit mode

Thanks Sukhdeep Singh,

I get an error at line 8, might be because I have an older version of R.

I've done it somehow like the way Pierre wrote.

Best

ADD REPLY • link updated 5.1 years ago by Ram 44k • written 10.8 years ago by hosseinv ▴ 20

0

Entering edit mode

Whats the error, `match` function might be missing or might be syntax error, but it should work fine. :)

ADD REPLY • link 10.8 years ago by Sukhi Singh 11k

0

Entering edit mode

At line 9, it gives me the following warning:

Warning message:
In read.table(file = file, header = header, sep = sep, quote = quote,  :
  incomplete final line found by readTableHeader on 'file_2.txt'

At line 12, I have this error below:

Error in match(paste(">", b$V2, sep = ""), sub, nomatch = 0) : 
  'match' requires vector arguments

ADD REPLY • link updated 5.1 years ago by Ram 44k • written 10.8 years ago by hosseinv ▴ 20

0

Entering edit mode

Solving first error might solve the second. Just open the file_2.txt in text editor, go to the last line and press ENTER, save it and repeat, it will work :)

ADD REPLY • link 10.8 years ago by Sukhi Singh 11k

0

Entering edit mode

I edited the second file in a text editor, and the warning message gone.

But, line 12 still gives me the error;

Error in match(paste(">", b$V2, sep = ""), sub, nomatch = 0) : 
  'match' requires vector arguments

ADD REPLY • link updated 5.1 years ago by Ram 44k • written 10.8 years ago by hosseinv ▴ 20

0

Entering edit mode

Sorry, my bad, I forgot to add one line

sub=a$V1[seq(1,nrow(a),by=2)]

Above we subset the chrom identifiers only, match couldn't find sub

I will update my answer :)

ADD REPLY • link updated 5.1 years ago by Ram 44k • written 10.8 years ago by Sukhi Singh 11k

0

Entering edit mode

Thank you for modifying the script. This time the code was run with no error, yet the output is slightly different from what should be. Here is the output by the code (please note the third entry)

>A00120
ATTTGATTTCTCATGCTAAACATTTATTGGTG
>A00122
TCTGTCGACGGCAACTGTGAAACTTATCAGTG
>A00124
GCACCCTGAGCCGAACTGAATTCCTTGTGAT

whereas it should be like this:

>A00120
ATTTGATTTCTCATGCTAAACATTTATTGGTG
>A00122
TCTGTCGACGGCAACTGTGAAACTTATCAGTG
>A00125
GCACCCTGAGCCGAACTGAATTCCTTGTGAT

The issue is cumming from line 15:

b=b[match(paste('>',b$V2,sep=''),sub,nomatch=0),]

Thanks again for help.

ADD REPLY • link updated 5.1 years ago by Ram 44k • written 10.8 years ago by hosseinv ▴ 20

1

Entering edit mode

You are right, I updated it!!

ADD REPLY • link updated 5.1 years ago by Ram 44k • written 10.8 years ago by Sukhi Singh 11k

0

Entering edit mode

THANK YOU, it works well now!

Best,
H

ADD REPLY • link updated 5.1 years ago by Ram 44k • written 10.8 years ago by hosseinv ▴ 20