a hash with a pair of sequences as a value
2
0
Entering edit mode
6.0 years ago
natasha.sernova ★ 3.9k

Dear ALL!

It’s a trivial question, but I have not found anything similar in biostars.org

Probably my search was not good enough, sorry!

I have several hundreds of bacterial genomes. Let’s look at one of them.

All its chromosomal protein sequences are stored in NC_\d+.faa – fasta-file.

All its “pre-translation”sequences, CDS, are stored in *.ffn  fasta-file with the same

order of corresponding sequences and the same name of file, NC_\d+.ffn. (I consider NCBI now.)

Only file extensions are different. I need a tool to reach nucleotide sequence quickly, if I take some protein sequence. E-utilities didn’t help me. The best idea I have so far is to count entries in both fasta-files, make a hash, call a protein sequence a hash-key (in Perl) and the respective nucleotide, “pre-translation” sequence of the protein - the hash value. But I will lose the file-name in this case.

Another way is to take the file-name without extension as a hash-key and a pair of the sequences

of the same count-number as a hash-value. But I will have to put each pair of sequences

into the array, and make a reference to this array become the hash-value. It’s too complicated in my opinion. Please, give me a hint how to make the procedure shorter and easier? Python - advice will also be fine, but bio-python does not work with my genomes, I've already tried it. Many thanks!

Sincerely yours, Natasha

Perl hash python dictionary • 1.8k views
1
Entering edit mode

It is not clear whether you want to retrieve the nucleotide sequence using a protein sequence query, or a name based query. However, to directly answer your question, you can also have a hash of arrays, or a dictionary of lists, in both Perl and python. Using your second approach for example, you can use the file name as the key, and the value will be an array where the first element will be the protein sequence, and the second element will be the nucleotide sequence.

If this aproach works for you, try it. I can help you with the code if there is any syntax-specific problem.

Edit:

Alternatively, you can also have two separate hashes, the keys of both will be the file name, and one will have the protein sequence and one will have the nucleotide sequence.

Edit 2:

I was planning to add a reply to your comment, but don't have enough reputation to post. So here is what I had to say:

By 2-index arrays, do you mean a matrix? Yes, matrices (aka array of arrays) are also allowed in Perl. One of the key philosophies of Perl is not to have unnecessary restrictions on the size and shape of data structures. However, in this case, you don't need an array of arrays. What you need is a hash with 2-element arrays as values. I have posted a simple example based on bioperl as an answer. Please see if that helps.

Cheers

0
Entering edit mode

Dear Tej,

Exactly, I was not clear here. I want to retrieve the nucleotide sequence using a protein sequence query, providing the corresponding key. The key is in the header of my nucleotide fasta-file, as well as my protein file, just extetion is different . The key is actually the file name without extention, you were right. I don't want to make a database of nucleotide sequences and search inside it - I would like to have both sequences (protein and nucleotide ones) together somehow, I don't know a fast approach to get the nucleotide sequence when I chose a protein with a key in its header just after > sign. I would use "the file name as the key, and the value will be an array where the first element will be the protein sequence, and the second element will be the nucleotide sequence", it's a very nice suggestion. Does Perl tolerate 2-index arrays? I have some doubts. But Python does. The code hints will be greatly appreciated! Many thanks! Natasha

3
Entering edit mode
5.9 years ago
Prakki Rama ★ 2.5k

I am assuming your fasta sequences to be in one line format. Try this in command line:

$ls nucl.ffn prot.faa  ---input--- $ head *
==> nucl.ffn <==
>test1_nucl
ACGTTAGCCCATGGCGATGAATGACCTGT
>test2_nucl
GATGAATGACCTGTACGTTAGCCCATGGC

==> prot.faa <==
>test1_prot
QWETREQE
>test2_prot
YUIPTYRTYR


Run this command:

$cat * | paste - - | sed 's/_/\t#/g' | sort -k1,1 -k2,2r | sed 's/\t#/_/g' | tr '\t' '\n'  ---output--- >test1_prot QWETREQE >test1_nucl ACGTTAGCCCATGGCGATGAATGACCTGT >test2_prot YUIPTYRTYR >test2_nucl GATGAATGACCTGTACGTTAGCCCATGGC  You may have many fasta files (both prot and nucl), and based on the header name, the above one-liner will sort and display the output. EDIT 2: If your input is like this, where the corresponding protein is not found in the nucleotide file (or vice-versa): ---input--- ==> nucl.ffn <== >test1_nucl ACGTTAGCCCATGGCGATGAATGACCTGT >test2_nucl GATGAATGACCTGTACGTTAGCCCATGGC >test3_nucl ATCGTGACTGATCG ==> prot.faa <== >test1_prot QWETREQE >test2_prot YUIPTYRTYR  You can try running the following commands in bash using process substitution <(): fgrep -h -A1 -f <(fgrep -h '>' * | sed 's/_.*//g' | sort | uniq -d) * | sed 's/--//g' | sed '/^$/d' | paste - - | sed 's/_/\t#/g' | sort -k1,1 -k2,2r | sed 's/\t#/_/g' | tr '\t' '\n' #this will find if both nucl and prot exist and print only those fasta sequences.


---output---

>test1_prot
QWETREQE
>test1_nucl
ACGTTAGCCCATGGCGATGAATGACCTGT
>test2_prot
YUIPTYRTYR
>test2_nucl
GATGAATGACCTGTACGTTAGCCCATGGC

0
Entering edit mode

GREAT! To find them would be one more problem... You kindly showed how to solve it.

THOUSAND THANKS!

Sincerely yours,

Natasha

0
Entering edit mode

Dear Prakki, I would like to extract nucleotide sequences based on their header. NO BLAST, it was used in the previous stages.

I will try your one-liners. THANK YOU!

Sincerely yours, Natasha

2
Entering edit mode
6.0 years ago
Tej Sowpati ▴ 250

Hi Natasha,

Please see the following code, which takes 2 files, one .faa and one .ffn file, and prints the protein and DNA sequences. You can build on this for multiple files.

  1 #! /usr/bin/perl
2 use strict;
3 use Bio::SeqIO;
4 use Getopt::Long;
5
6 my ($protein_file,$dna_file);
7 GetOptions( 'protein=s' => \$protein_file, 8 'dna=s' => \$dna_file   );
9
10 my %hash;
11
12 # Get the protein sequence
13 my $in = Bio::SeqIO->new( -file =>$protein_file,
14                             -format => 'fasta'  );
15 my $seq =$in->next_seq();
16 my $key =$seq->id;
17 $hash{$key}[0] = $seq->seq; 18 19 # Get the DNA sequence 20$in = Bio::SeqIO->new(  -file => $dna_file, 21 -format => 'fasta' ); 22$seq = $in->next_seq(); 23$key = $seq->id; 24$hash{$key}[1] =$seq->seq;
25
26 # Accessing the sequences. Let's say the sequence name is 'test'
27 my $name = 'test'; 28 print "The protein sequence is:$hash{$name}[0]\n"; # prints the protein sequence 29 print "The DNA sequence is:$hash{$name}[1]\n"; # prints the DNA sequence  Note: This needs bioperl to run. I would sincerely recommend using bioperl instead of coding your own FASTA parser. I should also mention that the key used to store the sequences is the sequence name specified in the header, not the file name. The input files I have used to test are: protein.faa: >test MAHALAIWKANCLA  DNA.ffn: >test ATGCGATGGACGTGATCGATAGAT  usage Assuming you saved the code in a file called new.pl: $ perl new.pl -protein protein.faa -dna DNA.ffn


Edit: changed the file names for clarity.

Let me know if this works for you.

Cheers,

TEJ

0
Entering edit mode

Dear Tej!

Thank you very much for your help! I will try your code today or tomorrow and see what I have. It looks very nice but I may forget something essential... Thousand thanks!

My best, Natasha

0
Entering edit mode

Dear Tej!

Will it work with fasta-files with multiple > signs? This is the reason why I came to the idea about hashes and database of nucleotide sequences... A bacterial genome has several thousand proteins. I have about a hundred protein sequences selected from each genome. It's a relatively large protein fasta-file for each genome. The problem is to find the corresponding nucleotide sequences in the same order. That is why I need such a complicated name for each sequence. If I split all my multiple fastas into single sequences I will have 50 thousand single sequences... And no way to return to the initial sequence order. - The order comes from protein file. The task looks much worse now, doesn't it? Do you have any idea how to deal with it? I hope you do have. THANK YOU VERY MUCH! Sincerely yours, Natasha

0
Entering edit mode

Dear Tej,

Probably I've asked a wrong question. It consisted of two or three different ones.

Let's again consider one genome. I know, that I will need protens 23, 57, 487. 543, 634, etc.

I mean their order-number from the very beginning of the genome. Their GI-numbers will not

work, I am afraid - I've still not learnt how to convert them to the corresponding CDS.

So I know their order numbers as headers and their sequences in *.faa-file.

It's some work, but OK. I can take these 300 proteins from *.faa-file and make a separate small fasta1_prot.fa file. The order of respective nucleotide sequences is the same in ffn-file from NCBI. So I can make a small nucleotide fasta-file with 300 corresponding sequences. If I run your script, I suppose, I will have a final file with 300 protein entries and then 300 nucleotide entries, fasta1_nucl.ffn. Will you script choose the first protein seq and the first nucl seq, then the second potein seq and the second nucl seq, etc up to 300th protein-nucleotide pair? I will need this, but I am very much afraid of the first result. It's my fault, I didn't tell you about multiple fastas.

INPUT:

prot.faa

>test1_prot
QWETREQE
>test2_prot
YUIPTYRTYR


nucl.ffn

>test1_nucl
ACGTTAGCCCATGGCGATGAATGACCTGT
>test2_nucl
GATGAATGACCTGTACGTTAGCCCATGGC


I expect the following output:

>test1_prot
QWETREQE
>test1_nucl
ACGTTAGCCCATGGCGATGAATGACCTGT
>test2_prot
YLIPTYRTYR
>test2_nucl
GATGAATGACCTGTACGTTAGCCCATGGC


Will I have it? If not, how to ask your script give it to me? You've mentioned multiple sequences. Did you mean multiple fasta-files or just many sequences?

I've run your program. It works very well, but it requires to give it the header of both sequences. But if I have many headers, how to convince your program to work? I have several thousand pairs. I will have to use multiple fastas. Will I have to give 300 headers to your program? Many thousands single pairs - it's forever... Please, help me to use multiple fastas! MANY-MANY THANKS!!!

You wrote, "You can build on this for multiple files.". What did you mean? Did I misunderstand you? I see the only solution - specific selection from multiple fastas for each genome. Is that wrong? THANK YOU!

Sincerely yours, Natasha

1
Entering edit mode

Dear Natasha,

I was away for the weekend and didn't see your comment. If Prakki's solution works for you, that's great. If you still need the code to be changed for multi-entry fasta file, let me know.

Cheers,

TEJ

0
Entering edit mode

Dear Tej,

Prakki's solution looks very nice! The only thing I worry about is that I am supposed to do sometinhg in command line.

I need to write a very long script for more then 100 genomes...

As an example I used E.coli faa- and ffn-files, I modified them as your program requires.

file.faa

>NC_000913|1
MKRISTTITTTITITTGNGAG
>NC_000913|2
FAELLTGLAAAQPGFPLAQLKTFVDQEFAQIKHVLHGISLLGQCPDSINAALICRGEKMSIAIMAGVLEA
LIKNTGNPQAPGTLIGASRDEDELPVKGISNLNNMAMFSVSGPGMKGMVGMAARVFAAMSRARISVVLIT
QSSSEYSISFCVPQSDCVRAERAMQEEFYLELKEGLLEPLAVTERLAIISVVGDGMRTLRGISAKFFAAL
ARANINIVAIAQGSSERSISVVVNNDDATTGVRVTHQMLFNTDQVIEVFVIGVGGVGGALLEQLKRQQSW
LKNKHIDLRVCGVANSKALLTNVHGLNLENWQEELAQAKEPFNLGRLIRLVKEYHLLNPVIVDCTSSQAV
IEPVLPAEFNAEGDVAAFMANLSQLDDLFAARVAKARDEGKVLRYVGNIDEDGVCRVKIAEVDGNDPLFK
>NC_000913|3
RFCQELGKQIPVAMTLEKNMPIGSGLGSSACSVVAALMAMNEHCGKPLNDTRLLALMGELEGRISGSIHY
DNVAPCFLGGMQLMIEENDIISQQVPGFDEWLWVLAYPGIKVSTAEARAILPAQYRRQDCIAHGRHLAGF
IHACYSRQPELAAKLMKDVIAEPYRERLLPGFRQARQAVAEIGAVASGISGSGPTLFALCDKPETAQRVA
DWLGKNYLQNQEGFVHICRLDTAGARVLEN


file.ffn

I hope with Prakki's suggestion I will have all pairs of protein and nucleotide sequences as an output for each genome.

But this is a global solution.

Prakki's next idea is wonderful, since I always have only protein sequences with modified headers. I will have to seach

for the corresponding nucleotide sequence in the database, made by makeblastdb out of all the modified-header nucleotide sequences. By the way, is it a good idea?

I don't want to make a huge hash for each whole bacterial genome, I will need just about 10 % out of each such a hash.

How to include the search procedure into my script? Let's suppose I take a unique modified protein header. I will need to seach for its nucleotide pair, actually its pretranslation sequence. Is it possible to make it automatically, just by script, without command line involment? I expect 30 to 40 thousand proteins. I will need a nucleotide pair for each protein.

Is it possible to reach the goal or that is just my dream? Both of you have already done a great job, I feel one more miracle I would like to have will remain to be a miracle...

I am going to take all my bacterial genomes. Modify their headers as you suggested for both faa and ffn-files.

And then I will have to select only the proteins I need and search for their nucleotide pairs. Tne nucleotide ffn-database with modified file headers - is it OK for this kind of seach? Or I can return somehow to your nice hashes?

And I will have to search in the database by grep, taking the selected protein headers. I will have to enumerate all the proteins I need nucleotide pairs to and search by script only, no command line.

Does it make any sense? THANK YOU!!!

Sincerely yours, Natasha

1
Entering edit mode

If you want to extract sequences based on header, my above one-liners will work. If you want to find out corresponding nucleotide sequence for a protein based on sequence similarity, even if it is (>100,000 sequences), tblastn can do that for you with in little bit time.

0
Entering edit mode

Dear Prakki, I would like to extract nucleotide sequences based on their header. NO BLAST, it was used in the previous stages.

I will try your one-liners as I 've said above.. THANK YOU!

Sincerely yours, Natasha

1
Entering edit mode

Yes, you can also search for the headers in the script. Hash based approach will be fast, but with a larger memory footprint. Alternatively, you can begin with a list of protein names, and then iterate over your DNA files to retrieve the corresponding sequences. To do that, you will need to wrap the fasta reading code in a while loop. Is that what you are looking for?

0
Entering edit mode

Dear Tej,

OK, I will begin with a list of protein names, and then iterate over my DNA files inside a hude database to retrieve the corresponding sequences . I hope it will work. THANK YOU!

Sincerely yours, Natasha