Question: How to concatenate two fasta files, using regular expressions
0
gravatar for shadowstep
3 months ago by
shadowstep0
shadowstep0 wrote:

* Thanks for your answers, my dear colleagues, it seems that I can't click that reply button *

I have 2 sets of fasta sequences, they are actually 2 genes of 9 species. I put the sequences of 9 species of the same gene into one folder, and the other gene into another folder. Now I want to concatenate two genes together for each species, but the first line of each fasta file looks like:

>HM357896.1 Persicaria lapathifolia voucher CPU:X. H. Meng 0945 ribulose-1,5-bisphosphate carboxylase/oxygenase large subunit (rbcL)

or

>JF953049.1 Acorus calamus voucher WH1 maturase K (matK) gene, partial cds; chloroplast

I think regular expression must be useful here, but how? Thank you.

UPDATE:: Sorry about my misleading description. To be specific, e.g. I have five species A B C D E, and two genes rbcL and matK. For each species I have two sequences, rbcL and matK. Thus I have 10 sequences in total (5 x 2). Then I combine all rbcL sequences (of five species) into one fasta, say all_rbcL.fasta, and I do the same to matK genes to make a all_matK.fasta. However, the first lines of these sequences seems to be messy, they do contain species name and gene name, but along with many other info.

How can I concatenate two genes together, and the species names must match each other?

UPDATE2:: (How could I enter code blocks?)

all_rbcL.fasta:
>sp1 rbcL
sequence
>sp2 rbcL
sequence
>sp3 rbcL
sequence
>sp4 rbcL
sequence
>sp5 rbcL
sequence

all_matK.fasta:
>sp1 matK
sequence
>sp2 matK
sequence
>sp3 matK
sequence
>sp4 matK
sequence
>sp5 matK
sequence

I mean something like this, and what I expected is:

concatenated.fasta:
>sp1 matK rbcL
sequence sequence
>sp2 matK rbcL
sequence sequence
>sp3 matK rbcL
sequence sequence
>sp4 matK rbcL
sequence sequence
>sp5 matK rbcL
sequence sequence

These two genes are from chloroplast, I do this to use them to build a phylogenetic tree of those 9 species, is it impossible or improper? I consulted a professor and he told me it is OK, and I would like to hear your opinions, thank you.

fasta concatenate • 343 views
ADD COMMENTlink modified 3 months ago • written 3 months ago by shadowstep0

Please be specific about your requirement. concatenation (joining to form a single file) can be done with a simple cat file1.fa file2.fa .. file9.fa > final_gene1.fa. If you want to take actual sequences into account and make a non-redundant representation then that would be more complicated.

ADD REPLYlink modified 3 months ago • written 3 months ago by genomax52k

You've used an example with two headers here - are they the same gene? Or are they the same species? How are they relevant to your question?

ADD REPLYlink written 3 months ago by Ram16k

How can I concatenate two genes together, and the species names must match each other?

That is still not very clear. You want something like this

>sp1_rbcL
sequence
>sp1_matK
sequence
>sp2_rbcL
sequence
>sp2_matK
sequence
ADD REPLYlink written 3 months ago by genomax52k

kamoulox

ADD REPLYlink written 3 months ago by Pierre Lindenbaum110k

concatenation is usually referred to when you add two lines underneath each other. from your example it seems as if you want both concatenation and pasting (= the adding of two columns next to each other)

I assume that your files are labelled somewhat systematically, so pasting the sequences of the same gene for different species next to each other should be trivial, e.g.:

paste sp2.matk.fa sp2.rbcl.fa 
>JF953049.1 Acorus calamus voucher WH1 maturase K (matK) gene, partial cds; chloroplast >KD1283900.1 Acorus calamus voucher CPU:X. H. Meng 0945 ribulos
accccgt agctagct

you could do this for every gene, e.g. by using a for-loop looping over the gene names which are hopefully part of the fasta file name.

now, I assume that the resulting header is what you wanted the regex help for?

for my butchered example above, this could, for example, be solved this way:

paste sp2.matk.fa sp2.rbcl.fa | sed 's/^>[A-Z0-9\.]* \([A-Z][a-z]* [a-z]*\).*\(matK\).*\(ribulos\)/\1\t\2\t\3/g'
Acorus calamus  matK    ribulos
accccgt agctagct

This particular (largely untested) regex expects:

  • the header line, i.e. the line we want to alter, starts with ">" followed by a combination of capitalized letters and numbers and periods
  • species names are expected after that header line beginning (and an additional white space); they are always made up of two words, the first one starting with a capitalized letter; the two words are separated by a single space and they always precede the gene names
  • I spell out the gene names (matK, ribulos [I know this is not the real name, but it's what I had included in my example]) because I was too lazy to think up a clever general regex for them
  • the output returns the first pattern surrounded by ( ) [= the species name] followed by a tab [\t] followed by the second pattern in ( ) [=first gene name] followed by a tab [\t] followed by the third pattern [the second gene name]
ADD REPLYlink modified 3 months ago • written 3 months ago by Friederike1.8k

You've explained your question well now, but I have to ask you - why do you want to do this? What is the ultimate aim? This seems like a convoluted procedure that has the end result of loss of meaningful information.

ADD REPLYlink written 3 months ago by Ram16k
2
gravatar for Hugo
3 months ago by
Hugo110
Universidade de Vigo, Ourense (Spain)
Hugo110 wrote:

Dear coleague, you can easily do this concatenation by using the "Concatenate sequences" option of SEDA http://www.sing-group.org/ . See section "3.17 Concatenate sequences" of the user manual available here: http://www.sing-group.org/seda/downloads/manuals/seda-user-manual-0.1.pdf

You simply have to select the two FASTA files and then use this option to concatenate sequences by 'Name'.

If you have any doubt, please, do not hesitate to contact me.

Regards,

Hugo.

ADD COMMENTlink modified 3 months ago by Istvan Albert ♦♦ 77k • written 3 months ago by Hugo110

You simply have to select the two FASTA files and then use this option to concatenate sequences by 'Name'.

You should add a note if this is not going to work for a multi-fasta file.

ADD REPLYlink written 3 months ago by genomax52k

Yes, SEDA works for multi-FASTA files.

ADD REPLYlink written 3 months ago by Hugo110
1

You are automatically parsing the fasta headers? Nice.

ADD REPLYlink modified 3 months ago • written 3 months ago by genomax52k

Right, sequence headers are automatically parsed to distinguish sequence identifiers (first set of characters until the first blank space) from the additional information. Then, this headers can be used in different functions with different purposes.

ADD REPLYlink written 3 months ago by Hugo110

I'd recommend you create a Tool post exhibiting the SEDA tool, much like you've done for DEWE.

ADD REPLYlink written 3 months ago by Ram16k

Sure, we will create a post as soon as we finish the implementation of some new features. Thank you!

ADD REPLYlink written 3 months ago by Hugo110

I think your tool needs some more work as it apparently concatenates the home_sapiens sequence to the mus_musculus one ...

(A typo in the manual I assume ;) )

ADD REPLYlink written 3 months ago by lieven.sterck1.9k

Right, it was a typo in the manual. We have just fixed it. Thank you.

ADD REPLYlink written 3 months ago by Hugo110
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 711 users visited in the last hour