How to concatenate multiple fasta file
4
3
Entering edit mode
6.3 years ago
fec2 ▴ 50

Hi,

I am using a pipeline called PanX for phylogenomic study. One of my aim is to build a core-genome phylogenomic tree. However, the output fasta files (~1100 gene files) for core-genome gene set are all separated by gene as below:

GeneA

>My_bacteriaA_GeneA 
atgatg

>My_bacteriaB_GeneA
atgaag

>My_bacteriaC_GeneA
atgatg

GeneB

>My_bacteriaB_GeneB
atggtc

>My_bacteriaC_GeneB
atggtc

>My_bacteriaA_GeneB
atggta

And on and on until...

GeneZ

>My_bacteriaA_GeneZ
atggta

>My_bacteriaC_GeneZ
atggta

>My_bacteriaB_GeneZ
atggtg

I wish to have a concatenated fasta file that combined every core-genes for each bacteria as below to build a phylogenomic tree:

>My_bacteriaA
atgatgatggta...atggta

>My_bacteriaB
atgaagatggtc...atggtg

>My_bacteriaC
atgatgatggtc...atggta

May I know how can I create the concatenated fasta file? Please note that the order of the gene sequences for different bacteria of each gene in each fasta file are random, not in a particular order.

Thank you!

sequence genome • 16k views
ADD COMMENT
1
Entering edit mode

Hi, SEDA (https://www.sing-group.org/seda) has an operation named "Concatenate sequences" (https://www.sing-group.org/seda/manual/operations.html#concatenate-sequences) aimed to do what you ask for. However, to use it, you need that the sequence names or identifiers to be identical, so you will need to perform a "Rename header" operation (https://www.sing-group.org/seda/manual/operations.html#rename-header) in order to split your sequence headers in "My_bacteriaA GeneA". Regards.

ADD REPLY
0
Entering edit mode

Hello fec2,

should the (GeneA), (GeneB) ... should be inserted in the concatenated sequence? I guess this will be a problem in further downstream analysis.

Are sequences always in one line or is this a multiline fasta?

fin swimmer

ADD REPLY
0
Entering edit mode

Hi, the (GeneA), (GeneB) .. shouldn't be inserted, I put it in my question just to show I want to combine all the gene sequences. I have removed it to avoid confusion. Thank you.

ADD REPLY
3
Entering edit mode
6.3 years ago
# split species and genes
$ for f in *.fa; do seqkit replace -p '^(.+)_(.+?)$' -r '$1 $2' $f > f.$f; done

$ seqkit seq -n f.GeneA.fa 
My_bacteriaA GeneA
My_bacteriaB GeneA
My_bacteriaC GeneA

# concatenate seqs by IDs
$ seqkit concat f.*.fa
>My_bacteriaA
atgatgatggtaatggta
>My_bacteriaB
atgaagatggtcatggtg
>My_bacteriaC
atgatgatggtcatggta
ADD COMMENT
0
Entering edit mode

@ shenwei356 is there an option to merge sequences in seqkit within a single fasta by ID?

ADD REPLY
0
Entering edit mode

Here's an example:

$ cat mix.fa 
>My_bacteriaA GeneA
a
>My_bacteriaA GeneB
c
>My_bacteriaB GeneA
g
>My_bacteriaB GeneB
t

Step 1. spliting by custom IDs

$ seqkit split --id-regexp "^.+ (.+?)$" -i mix.fa  -f -O out            
[INFO] split by ID. idRegexp: ^.+ (.+?)$
[INFO] read sequences ...
[INFO] read 4 sequences
[INFO] write 2 sequences to file: out/mix.id_GeneA.fa
[INFO] write 2 sequences to file: out/mix.id_GeneB.fa

$ cat out/mix.id_GeneA.fa 
>My_bacteriaA GeneA
a
>My_bacteriaB GeneA
g

Step 2. concat

$ seqkit concat out/*.fa
>My_bacteriaA
ac
>My_bacteriaB
gt
ADD REPLY
0
Entering edit mode

Hi @shenwei356 , I am encountering similar problem, where each isolate (T_122, T_123) has two regions of sequence of different lengths, but I want to merge according to its isolates, and wanting to split the sequence according to the region in the format of chr:start-end (ie JA2.1:22400-23038 and JA2.1:1673805-1675711). Could you kindly advise me how shd I amend "--id-regexp".

    $ cat Merged.fa
>T_122_JA2.1:22400-23038
AAA
>T_122_JA2.1:1673805-1675711
C
>T_123_JA2.1:22400-23038
GGG
>T_123_JA2.1:1673805-1675711
T

These are just two for example, but I hope it looks like this for about 100 isolates, so that I can submit to iqtree.

 >T_122 
 AAAC
 >T_123
 GGGT
ADD REPLY
0
Entering edit mode

Another common way:

  1. Split by isolates

     $ seqkit split --id-regexp "^(.+)_[\w\.]+:.+" --by-id-prefix "" -i merged.fa -O tmp --force
     [INFO] split by ID. idRegexp: ^(.+)_[\w\.]+:.+
     [INFO] read sequences ...
     [INFO] read 4 sequences
     [INFO] write 2 sequences to file: tmp/T_122.fa
     [INFO] write 2 sequences to file: tmp/T_123.fa
    
  2. concatenate all bases in each isolate

     mkdir out
    
     for f in tmp/*.fa; do
         iso=$(echo $f | sed -E 's/.+\///' | sed 's/.fa//')
         (echo ">$iso"; grep -v ">" $f) | seqkit seq > out/$iso.fasta
     done
    
    $ ls out/
    T_122.fasta  T_123.fasta

    $ cat out/*.fasta
    >T_122
    AAAC
    >T_123
    GGGT
ADD REPLY
0
Entering edit mode

@shenwei356 Thank you so much! Your prompt reply helped me a lot!

ADD REPLY
0
Entering edit mode

@shenwei356 My question is related to the one above (@fec2) but the headers are a bit more problematic. Is there a way to concatenate with SeqKit by taking into account only a part of the header (in this case, the acc. number)? The gene sequences are in different files. Thank you!

>ENA|MCFL01000433|MCFL01000433.1|F|SSU Extracted SSU sequence (7064 bp)
AACTTGTGCAGCAGCAAAAAAGGTGCAAAACTACAGCACCCAG...

>ENA|MCFL01000433|MCFL01000433.1|F|LSU Extracted LSU sequence (5892 bp)
CATATTAATAAGCGGAGGAAAAGAAACTAACAAGGATACCC...
ADD REPLY
2
Entering edit mode
--id-regexp '^[^\|]+\|[^\|]+\|([^\|]+)\|'
ADD REPLY
0
Entering edit mode

Great, thank you very much!

ADD REPLY
0
Entering edit mode

Is it also possible with SeqKit to concatenate based on the acc. number but keep (print) all the full names of the concatenated genes? E.g., AYX1000_SSU_gene & AYX1000_LSU_gene to: AYX1000_SSU_geneAYX1000_LSU_gene

ADD REPLY
2
Entering edit mode
6.3 years ago

What about this (based on your header description here)?

$ cut -f1 -d "-" *.fa|awk -v RS=">" -v FS="\n" -v OFS="\n" '{for(i=2; i<=NF; i++) {seq[$1] = seq[$1]$i}}; END {for(id in seq){print ">"id, seq[id]}}' > combined.fa
ADD COMMENT
0
Entering edit mode

Thank you very much! This is working.

ADD REPLY
0
Entering edit mode

Hi, I am having the same question as this one. But my headers look different, I tried your solution but it removes all of my headers. Would you be so kind to help me with it?

>Lineage-1-Name-L1
ATGCGGAAAGGGGTTGATCTCGTGACgGC....
>Lineage-2-y2007-018
ATGCGGAAAGGGGTTGATCTCGTGACgG...

Thanks a lot.

ADD REPLY
0
Entering edit mode
6.3 years ago

Hello,

here an awk solution, assuming that each sequence has only a single line:

cat *.fa|awk -v RS=">" -v FS="\n" -v OFS="\n" '$0 {match($1, /^(.+)_[^_]+$/, name); seq[name[1]] = seq[name[1]]$2}; END {for(id in seq){print ">"id, seq[id]}}' > combined.fa

fin swimmer

ADD COMMENT
0
Entering edit mode

Hi,

I got an error as below:

awk: line 1: syntax error at or near ,

May I know is it because my fasta file is not a single line fasta file?

Thank you.

ADD REPLY
0
Entering edit mode

Hello again,

no the syntax error is not due to multiline fasta. Maybe you missed something to copy from my command?

As I said, the command above doesn't work for multiline fasta. For this try this:

$ cat *.fa|awk -v RS=">" -v FS="\n" -v OFS="\n" '$0 {match($1, /^(.+)_[^_]+$/, name); for(i=2; i<=NF; i++) {seq[name[1]] = seq[name[1]]$i}}; END {for(id in seq){print ">"id, seq[id]}}' > combined.fa

fin swimmer

ADD REPLY
0
Entering edit mode

Hi, I got the same error, below is an example of my sequence ID:

>Planococcus_faecalis_AJ003-AJGP001_RS02220-1-thiol_reductant_ABC_exporter_subunit_CydC <unknown description>

May I know how to see whether the fasta file is multiline or single line?

Thank you.

ADD REPLY
0
Entering edit mode

fec2 for making zero length fasta files: try seqkit seq -w 0 input.fa

ADD REPLY
0
Entering edit mode
6.3 years ago

Assuming that sequences in files are in exact OP format: output:

$ awk -v RS=">" -v OFS="\t" '{print $1,$2}' *.fa | sed '/^\s*$/d;s/_\w\{5\}//2g' | datamash -sg 1 collapse 2 | sed 's/,//g' | awk -v OFS="\n" '{print ">"$1,$2}'

>My_bacteriaA
atgatgatggtaatggta
>My_bacteriaB
atgaagatggtcatggtg
>My_bacteriaC
atgatgatggtcatggta

input (from OP);

$ tail -n +1  *.fa
==> genea.fa <==
> My_bacteriaA_GeneA
atgatg
> My_bacteriaB_GeneA
atgaag
> My_bacteriaC_GeneA
atgatg

==> geneb.fa <==
> My_bacteriaB_GeneB
atggtc
> My_bacteriaC_GeneB
atggtc
> My_bacteriaA_GeneB
atggta

==> genez.fa <==
> My_bacteriaA_GeneZ
atggta
> My_bacteriaC_GeneZ
atggta
> My_bacteriaB_GeneZ
atggtg
ADD COMMENT
0
Entering edit mode

Thank you, but is not really working, there is no output file in my folder.

Here is an example of my input sequence ID:

>Planococcus_faecalis_AJ003-AJGP001_RS02220-1-thiol_reductant_ABC_exporter_subunit_CydC <unknown description>
ADD REPLY
0
Entering edit mode

Hello fec2,

Please use the formatting bar (especially the code option) to present your post better.

code_formatting

I cannot clearly see how your sequence ID looks like.

Thank you!

ADD REPLY
0
Entering edit mode

I am sorry I am new in this field and this forum. I have edited the comment above. Thank you.

ADD REPLY
0
Entering edit mode

hi fec2 could you please post few more headers? pattern can be improved if we have more fasta headers.

ADD REPLY
0
Entering edit mode

Hi, here is example of 1 gene file:

>Planococcus_faecalis_AJ003-AJGP001_RS02220-1-thiol_reductant_ABC_exporter_subunit_CydC <unknown description>
>Planococcus_kocurii_ATCC_43650-AUO94_RS06550-1-thiol_reductant_ABC_exporter_subunit_CydC <unknown description>
>Planococcus_antarcticus_DSM14505-BBH88_RS02420-1-thiol_reductant_ABC_exporter_subunit_CydC <unknown description>
>Planococcus_versutus_L10_15-I858_RS02180-1-thiol_reductant_ABC_exporter_subunit_CydC <unknown description>
>Planococcus_halocryophilus_DSM24743-BBI08_RS02205-1-thiol_reductant_ABC_exporter_subunit_CydC <unknown description>
>Planococcus_sp_PAMC_21323-PLANO_RS01740-1-thiol_reductant_ABC_exporter_subunit_CydC <unknown description>
>Planococcus_donghaensis_MPA1U2-GPDM_RS01230-63-cydC_thiol_reductant_ABC_exporter_subunit_CydC <unknown description>
>Planococcus_donghaensis_DSM22276-BCM40_RS02220-1-thiol_reductant_ABC_exporter_subunit_CydC <unknown description>
>Planococcus_sp_SCU63-DP120_RS12840-13-cydC_thiol_reductant_ABC_exporter_subunit_CydC <unknown description>
>Planococcus_sp_CAU13-KW93_RS09120-323-thiol_reductant_ABC_exporter_subunit_CydC <unknown description>
>Planococcus_maritimus_Y42-B0X71_RS16755-1-thiol_reductant_ABC_exporter_subunit_CydC <unknown description>
>Planococcus_maritimus_SAMP-BHE17_RS12170-31-thiol_reductant_ABC_exporter_subunit_CydC <unknown description>
>Planococcus_maritimus_MKU009-AY633_RS12345-32-thiol_reductant_ABC_exporter_subunit_CydC <unknown description>
ADD REPLY
0
Entering edit mode

Could you please explain how the bacterias are delimited from the genes?

ADD REPLY
0
Entering edit mode

The pipeline using NCBI genebank file as input.

The ID actually contain

>Bacteria species and strain name-locus tag of the gene-gene name

For example for

>Planococcus_faecalis_AJ003-AJGP001_RS02220-1-thiol_reductant_ABC_exporter_subunit_CydC <unknown description>

>Planococcus_faecalis_AJ003(Bacteria species and strain name)-AJGP001_RS02220(locus tag of the gene)-1-thiol_reductant_ABC_exporter_subunit_CydC(gene name) <unknown description>

Thank you.

ADD REPLY
0
Entering edit mode

Please show REAL example ASAP.

$ for f in *.fa; do seqkit replace -p '^(.+?)-(.+)$' -r '$1 $2' $f > f.$f; done

$ seqkit concat f.*.fa
ADD REPLY
0
Entering edit mode

Sorry for not showing the real example in the beginning. The command is working but the ID of each bacteria all become 'S1'. I can't differentiate them. May I know have I done anything wrong?

I also noticed that for each f.*.fa file, the multiple fasta are in different order, I guess it just removed the bacteria species and strain name. As I have raised in the beginning, all the multiple fasta sequence are in random order in each file. Will it be affected when I concatenate them?

ADD REPLY

Login before adding your answer.

Traffic: 1065 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6