Question: Rename entries of file_1 using their corresponding ids in file_2
0
gravatar for hosseinv
5.2 years ago by
hosseinv20
Australia
hosseinv20 wrote:

Hi,

I have two files as following:

$ cat file_1.fas
>CHROM-g19-B-0001-66906-67533
ATTTGATTTCTCATGCTAAACATTTATTGGTG
>CHROM-g19-B-0010-143637-144790
TCTGTCGACGGCAACTGTGAAACTTATCAGTG
>CHROM-g19-B-0010-147754-150523
GCACCCTGAGCCGAACTGAATTCCTTGTGAT

$ cat file_2.txt
A00120 CHROM-g19-B-0001-66906-67533
A00122 CHROM-g19-B-0010-143637-144790
A00124 CHROM-g19-B-0010-145875-146742
A00125 CHROM-g19-B-0010-147754-150523

I need to rename entries in file_1.fas with their corresponding ids in file_2.txt, to get the following;

$ cat file_3.fas
>A00120
ATTTGATTTCTCATGCTAAACATTTATTGGTG
>A00122
TCTGTCGACGGCAACTGTGAAACTTATCAGTG
>A00125
GCACCCTGAGCCGAACTGAATTCCTTGTGAT

 

NOTES:

In my real data, file_2.txt has some more ids that can not be found in file_1.fas, and I don't need them either, because there will be no entries in file_1.fas to be replaced. Example will be "A00124 CHROM-g19-B-0010-145875-146742" in file_2.txt.

 

Thank you for helping me on this post.

Hossein

 

unix rename • 2.1k views
ADD COMMENTlink modified 5.2 years ago by Pierre Lindenbaum121k • written 5.2 years ago by hosseinv20
1

What have you tried? What programming language(s) do you know?

ADD REPLYlink written 5.2 years ago by Emily_Ensembl18k

I'm still in the beginning of scripting. Know a bit of shell, and perl.

ADD REPLYlink written 5.2 years ago by hosseinv20

If you're doing this with Perl or Python you'll want to look at reading the contents of `file_2` into a "hash" or "dictionary" data structure. Then as you loop through the `file_1` contents you can identify the header lines and then use them as "keys" to return the associated "value".

ADD REPLYlink written 5.2 years ago by Matt Shirley9.0k
2
gravatar for Sukhdeep Singh
5.2 years ago by
Sukhdeep Singh9.8k
Netherlands
Sukhdeep Singh9.8k wrote:

 

Sorry, I am addicted to R, but you could do this faster and efficient using Perl/Python/Ruby/Shell etc.

Output

>A00120
ATTTGATTTCTCATGCTAAACATTTATTGGTG
>A00122
TCTGTCGACGGCAACTGTGAAACTTATCAGTG
>A00125
GCACCCTGAGCCGAACTGAATTCCTTGTGAT

ADD COMMENTlink modified 5.2 years ago • written 5.2 years ago by Sukhdeep Singh9.8k

Thanks Sukhdeep Singh,

I get an error at line 8, might be because I have an older version of R.

I've done it somehow like the way Pierre wrote.

Best

ADD REPLYlink written 5.2 years ago by hosseinv20

Whats the error, `match` function might be missing or might be syntax error, but it should work fine. :)

ADD REPLYlink written 5.2 years ago by Sukhdeep Singh9.8k

At line 9, it gives me the following warning:

Warning message:
In read.table(file = file, header = header, sep = sep, quote = quote,  :
  incomplete final line found by readTableHeader on 'file_2.txt'

At line 12, I have this error below:

Error in match(paste(">", b$V2, sep = ""), sub, nomatch = 0) : 
  'match' requires vector arguments

 

ADD REPLYlink written 5.2 years ago by hosseinv20

Solving first error might solve the second. Just open the file_2.txt in text editor, go to the last line and press ENTER, save it and repeat, it will work :)

ADD REPLYlink written 5.2 years ago by Sukhdeep Singh9.8k

I edited the second file in a text editor, and the warning message gone.

But, line 12 still gives me the error;

Error in match(paste(">", b$V2, sep = ""), sub, nomatch = 0) : 
  'match' requires vector arguments
ADD REPLYlink modified 5.2 years ago • written 5.2 years ago by hosseinv20

Sorry, my bad, I forgot to add one line

sub=a$V1[seq(1,nrow(a),by=2)]

Above we subset the chrom identifiers only, match couldn't find sub
I will update my answer :) 
ADD REPLYlink written 5.2 years ago by Sukhdeep Singh9.8k

Thank you for modifying the script. This time the code was run with no error, yet the output is slightly different from what should be. Here is the output by the code

>A00120
ATTTGATTTCTCATGCTAAACATTTATTGGTG
>A00122
TCTGTCGACGGCAACTGTGAAACTTATCAGTG
>A00124
GCACCCTGAGCCGAACTGAATTCCTTGTGAT

whereas it should be like this:

>A00120
ATTTGATTTCTCATGCTAAACATTTATTGGTG
>A00122
TCTGTCGACGGCAACTGTGAAACTTATCAGTG
>A00125
GCACCCTGAGCCGAACTGAATTCCTTGTGAT

The issue is cumming from line 15:

b=b[match(paste('>',b$V2,sep=''),sub,nomatch=0),]

Thanks again for help.

ADD REPLYlink modified 5.2 years ago • written 5.2 years ago by hosseinv20
1

You are right, I updated it!!

ADD REPLYlink written 5.2 years ago by Sukhdeep Singh9.8k

THANK YOU, it works well now!

Best,

H

 

ADD REPLYlink written 5.2 years ago by hosseinv20
0
gravatar for Pierre Lindenbaum
5.2 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum121k wrote:

hints:

linearize the fasta file, sort on the sequence:

awk -F ' ' '/^>/ { printf("\n%s\t%s",$0,$1);next;} { printf("%s",$0);} END { printf("\n");}' | sort -t '  ' -k2,2

sort "file_2.txt" on the 2nd column use unix join to join both ouputs

convert the ouput of join back to fasta using awk.

ADD COMMENTlink written 5.2 years ago by Pierre Lindenbaum121k

Thank you Pierre,

I used simply the paste command and it's done.

Regards

ADD REPLYlink written 5.2 years ago by hosseinv20
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 937 users visited in the last hour