How to concatenate two fasta files, using regular expressions
1
0
Entering edit mode
6.0 years ago
shadowstep • 0

* Thanks for your answers, my dear colleagues, it seems that I can't click that reply button *

I have 2 sets of fasta sequences, they are actually 2 genes of 9 species. I put the sequences of 9 species of the same gene into one folder, and the other gene into another folder. Now I want to concatenate two genes together for each species, but the first line of each fasta file looks like:

>HM357896.1 Persicaria lapathifolia voucher CPU:X. H. Meng 0945 ribulose-1,5-bisphosphate carboxylase/oxygenase large subunit (rbcL)

or

>JF953049.1 Acorus calamus voucher WH1 maturase K (matK) gene, partial cds; chloroplast

I think regular expression must be useful here, but how? Thank you.

UPDATE:: Sorry about my misleading description. To be specific, e.g. I have five species A B C D E, and two genes rbcL and matK. For each species I have two sequences, rbcL and matK. Thus I have 10 sequences in total (5 x 2). Then I combine all rbcL sequences (of five species) into one fasta, say all_rbcL.fasta, and I do the same to matK genes to make a all_matK.fasta. However, the first lines of these sequences seems to be messy, they do contain species name and gene name, but along with many other info.

How can I concatenate two genes together, and the species names must match each other?

UPDATE2:: (How could I enter code blocks?)

all_rbcL.fasta:
>sp1 rbcL
sequence
>sp2 rbcL
sequence
>sp3 rbcL
sequence
>sp4 rbcL
sequence
>sp5 rbcL
sequence

all_matK.fasta:
>sp1 matK
sequence
>sp2 matK
sequence
>sp3 matK
sequence
>sp4 matK
sequence
>sp5 matK
sequence

I mean something like this, and what I expected is:

concatenated.fasta:
>sp1 matK rbcL
sequence sequence
>sp2 matK rbcL
sequence sequence
>sp3 matK rbcL
sequence sequence
>sp4 matK rbcL
sequence sequence
>sp5 matK rbcL
sequence sequence

These two genes are from chloroplast, I do this to use them to build a phylogenetic tree of those 9 species, is it impossible or improper? I consulted a professor and he told me it is OK, and I would like to hear your opinions, thank you.

fasta concatenate • 3.7k views
ADD COMMENT
0
Entering edit mode

Please be specific about your requirement. concatenation (joining to form a single file) can be done with a simple cat file1.fa file2.fa .. file9.fa > final_gene1.fa. If you want to take actual sequences into account and make a non-redundant representation then that would be more complicated.

ADD REPLY
0
Entering edit mode

You've used an example with two headers here - are they the same gene? Or are they the same species? How are they relevant to your question?

ADD REPLY
0
Entering edit mode

How can I concatenate two genes together, and the species names must match each other?

That is still not very clear. You want something like this

>sp1_rbcL
sequence
>sp1_matK
sequence
>sp2_rbcL
sequence
>sp2_matK
sequence
ADD REPLY
0
Entering edit mode

kamoulox

ADD REPLY
0
Entering edit mode

concatenation is usually referred to when you add two lines underneath each other. from your example it seems as if you want both concatenation and pasting (= the adding of two columns next to each other)

I assume that your files are labelled somewhat systematically, so pasting the sequences of the same gene for different species next to each other should be trivial, e.g.:

paste sp2.matk.fa sp2.rbcl.fa 
>JF953049.1 Acorus calamus voucher WH1 maturase K (matK) gene, partial cds; chloroplast >KD1283900.1 Acorus calamus voucher CPU:X. H. Meng 0945 ribulos
accccgt agctagct

you could do this for every gene, e.g. by using a for-loop looping over the gene names which are hopefully part of the fasta file name.

now, I assume that the resulting header is what you wanted the regex help for?

for my butchered example above, this could, for example, be solved this way:

paste sp2.matk.fa sp2.rbcl.fa | sed 's/^>[A-Z0-9\.]* \([A-Z][a-z]* [a-z]*\).*\(matK\).*\(ribulos\)/\1\t\2\t\3/g'
Acorus calamus  matK    ribulos
accccgt agctagct

This particular (largely untested) regex expects:

  • the header line, i.e. the line we want to alter, starts with ">" followed by a combination of capitalized letters and numbers and periods
  • species names are expected after that header line beginning (and an additional white space); they are always made up of two words, the first one starting with a capitalized letter; the two words are separated by a single space and they always precede the gene names
  • I spell out the gene names (matK, ribulos [I know this is not the real name, but it's what I had included in my example]) because I was too lazy to think up a clever general regex for them
  • the output returns the first pattern surrounded by ( ) [= the species name] followed by a tab [\t] followed by the second pattern in ( ) [=first gene name] followed by a tab [\t] followed by the third pattern [the second gene name]
ADD REPLY
0
Entering edit mode

You've explained your question well now, but I have to ask you - why do you want to do this? What is the ultimate aim? This seems like a convoluted procedure that has the end result of loss of meaningful information.

ADD REPLY
3
Entering edit mode
6.0 years ago
Hugo ▴ 380

Dear coleague, you can easily do this concatenation by using the "Concatenate sequences" option of SEDA http://www.sing-group.org/ . See section "3.17 Concatenate sequences" of the user manual available here: http://www.sing-group.org/seda/downloads/manuals/seda-user-manual-0.1.pdf

You simply have to select the two FASTA files and then use this option to concatenate sequences by 'Name'.

If you have any doubt, please, do not hesitate to contact me.

Regards,

Hugo.

ADD COMMENT
0
Entering edit mode

You simply have to select the two FASTA files and then use this option to concatenate sequences by 'Name'.

You should add a note if this is not going to work for a multi-fasta file.

ADD REPLY
0
Entering edit mode

Yes, SEDA works for multi-FASTA files.

ADD REPLY
1
Entering edit mode

You are automatically parsing the fasta headers? Nice.

ADD REPLY
0
Entering edit mode

Right, sequence headers are automatically parsed to distinguish sequence identifiers (first set of characters until the first blank space) from the additional information. Then, this headers can be used in different functions with different purposes.

ADD REPLY
0
Entering edit mode

I'd recommend you create a Tool post exhibiting the SEDA tool, much like you've done for DEWE.

ADD REPLY
0
Entering edit mode

Sure, we will create a post as soon as we finish the implementation of some new features. Thank you!

ADD REPLY
0
Entering edit mode

I think your tool needs some more work as it apparently concatenates the home_sapiens sequence to the mus_musculus one ...

(A typo in the manual I assume ;) )

ADD REPLY
0
Entering edit mode

Right, it was a typo in the manual. We have just fixed it. Thank you.

ADD REPLY

Login before adding your answer.

Traffic: 2700 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6