How to concatenate multiple fasta file
4
2
Entering edit mode
3.2 years ago
fec2 ▴ 40

Hi,

I am using a pipeline called PanX for phylogenomic study. One of my aim is to build a core-genome phylogenomic tree. However, the output fasta files (~1100 gene files) for core-genome gene set are all separated by gene as below:

GeneA

>My_bacteriaA_GeneA
atgatg

>My_bacteriaB_GeneA
atgaag

>My_bacteriaC_GeneA
atgatg


GeneB

>My_bacteriaB_GeneB
atggtc

>My_bacteriaC_GeneB
atggtc

>My_bacteriaA_GeneB
atggta


And on and on until...

GeneZ

>My_bacteriaA_GeneZ
atggta

>My_bacteriaC_GeneZ
atggta

>My_bacteriaB_GeneZ
atggtg


I wish to have a concatenated fasta file that combined every core-genes for each bacteria as below to build a phylogenomic tree:

>My_bacteriaA
atgatgatggta...atggta

>My_bacteriaB
atgaagatggtc...atggtg

>My_bacteriaC
atgatgatggtc...atggta


May I know how can I create the concatenated fasta file? Please note that the order of the gene sequences for different bacteria of each gene in each fasta file are random, not in a particular order.

Thank you!

sequence genome • 6.3k views
1
Entering edit mode

Hi, SEDA (https://www.sing-group.org/seda) has an operation named "Concatenate sequences" (https://www.sing-group.org/seda/manual/operations.html#concatenate-sequences) aimed to do what you ask for. However, to use it, you need that the sequence names or identifiers to be identical, so you will need to perform a "Rename header" operation (https://www.sing-group.org/seda/manual/operations.html#rename-header) in order to split your sequence headers in "My_bacteriaA GeneA". Regards.

0
Entering edit mode

Hello fec2,

should the (GeneA), (GeneB) ... should be inserted in the concatenated sequence? I guess this will be a problem in further downstream analysis.

Are sequences always in one line or is this a multiline fasta?

fin swimmer

0
Entering edit mode

Hi, the (GeneA), (GeneB) .. shouldn't be inserted, I put it in my question just to show I want to combine all the gene sequences. I have removed it to avoid confusion. Thank you.

2
Entering edit mode
3.2 years ago
# split species and genes
$for f in *.fa; do seqkit replace -p '^(.+)_(.+?)$' -r '$1$2' $f > f.$f; done

$seqkit seq -n f.GeneA.fa My_bacteriaA GeneA My_bacteriaB GeneA My_bacteriaC GeneA # concatenate seqs by IDs$ seqkit concat f.*.fa
>My_bacteriaA
atgatgatggtaatggta
>My_bacteriaB
atgaagatggtcatggtg
>My_bacteriaC
atgatgatggtcatggta

0
Entering edit mode

@ shenwei356 is there an option to merge sequences in seqkit within a single fasta by ID?

0
Entering edit mode

Here's an example:

$cat mix.fa >My_bacteriaA GeneA a >My_bacteriaA GeneB c >My_bacteriaB GeneA g >My_bacteriaB GeneB t  Step 1. spliting by custom IDs $ seqkit split --id-regexp "^.+ (.+?)$" -i mix.fa -f -O out [INFO] split by ID. idRegexp: ^.+ (.+?)$
[INFO] write 2 sequences to file: out/mix.id_GeneA.fa
[INFO] write 2 sequences to file: out/mix.id_GeneB.fa

$cat out/mix.id_GeneA.fa >My_bacteriaA GeneA a >My_bacteriaB GeneA g  Step 2. concat $ seqkit concat out/*.fa
>My_bacteriaA
ac
>My_bacteriaB
gt

0
Entering edit mode

@shenwei356 My question is related to the one above (@fec2) but the headers are a bit more problematic. Is there a way to concatenate with SeqKit by taking into account only a part of the header (in this case, the acc. number)? The gene sequences are in different files. Thank you!

>ENA|MCFL01000433|MCFL01000433.1|F|SSU Extracted SSU sequence (7064 bp)
AACTTGTGCAGCAGCAAAAAAGGTGCAAAACTACAGCACCCAG...

>ENA|MCFL01000433|MCFL01000433.1|F|LSU Extracted LSU sequence (5892 bp)
CATATTAATAAGCGGAGGAAAAGAAACTAACAAGGATACCC...

2
Entering edit mode
--id-regexp '^[^\|]+\|[^\|]+\|([^\|]+)\|'

0
Entering edit mode

Great, thank you very much!

0
Entering edit mode

Is it also possible with SeqKit to concatenate based on the acc. number but keep (print) all the full names of the concatenated genes? E.g., AYX1000_SSU_gene & AYX1000_LSU_gene to: AYX1000_SSU_geneAYX1000_LSU_gene

2
Entering edit mode
3.2 years ago

$cut -f1 -d "-" *.fa|awk -v RS=">" -v FS="\n" -v OFS="\n" '{for(i=2; i<=NF; i++) {seq[$1] = seq[$1]$i}}; END {for(id in seq){print ">"id, seq[id]}}' > combined.fa

0
Entering edit mode

Thank you very much! This is working.

0
Entering edit mode

Hi, I am having the same question as this one. But my headers look different, I tried your solution but it removes all of my headers. Would you be so kind to help me with it?

>Lineage-1-Name-L1
ATGCGGAAAGGGGTTGATCTCGTGACgGC....
>Lineage-2-y2007-018
ATGCGGAAAGGGGTTGATCTCGTGACgG...


Thanks a lot.

0
Entering edit mode
3.2 years ago

Hello,

here an awk solution, assuming that each sequence has only a single line:

cat *.fa|awk -v RS=">" -v FS="\n" -v OFS="\n" '$0 {match($1, /^(.+)_[^_]+$/, name); seq[name[1]] = seq[name[1]]$2}; END {for(id in seq){print ">"id, seq[id]}}' > combined.fa


fin swimmer

0
Entering edit mode

Hi,

I got an error as below:

awk: line 1: syntax error at or near ,


May I know is it because my fasta file is not a single line fasta file?

Thank you.

0
Entering edit mode

Hello again,

no the syntax error is not due to multiline fasta. Maybe you missed something to copy from my command?

As I said, the command above doesn't work for multiline fasta. For this try this:

$cat *.fa|awk -v RS=">" -v FS="\n" -v OFS="\n" '$0 {match($1, /^(.+)_[^_]+$/, name); for(i=2; i<=NF; i++) {seq[name[1]] = seq[name[1]]$i}}; END {for(id in seq){print ">"id, seq[id]}}' > combined.fa  fin swimmer ADD REPLY 0 Entering edit mode Hi, I got the same error, below is an example of my sequence ID: >Planococcus_faecalis_AJ003-AJGP001_RS02220-1-thiol_reductant_ABC_exporter_subunit_CydC <unknown description> May I know how to see whether the fasta file is multiline or single line? Thank you. ADD REPLY 0 Entering edit mode fec2 for making zero length fasta files: try seqkit seq -w 0 input.fa ADD REPLY 0 Entering edit mode 3.2 years ago Assuming that sequences in files are in exact OP format: output: $ awk -v RS=">" -v OFS="\t" '{print $1,$2}' *.fa | sed '/^\s*$/d;s/_\w\{5\}//2g' | datamash -sg 1 collapse 2 | sed 's/,//g' | awk -v OFS="\n" '{print ">"$1,$2}' >My_bacteriaA atgatgatggtaatggta >My_bacteriaB atgaagatggtcatggtg >My_bacteriaC atgatgatggtcatggta  input (from OP); $ tail -n +1  *.fa
==> genea.fa <==
> My_bacteriaA_GeneA
atgatg
> My_bacteriaB_GeneA
atgaag
> My_bacteriaC_GeneA
atgatg

==> geneb.fa <==
> My_bacteriaB_GeneB
atggtc
> My_bacteriaC_GeneB
atggtc
> My_bacteriaA_GeneB
atggta

==> genez.fa <==
> My_bacteriaA_GeneZ
atggta
> My_bacteriaC_GeneZ
atggta
> My_bacteriaB_GeneZ
atggtg

0
Entering edit mode

Thank you, but is not really working, there is no output file in my folder.

Here is an example of my input sequence ID:

>Planococcus_faecalis_AJ003-AJGP001_RS02220-1-thiol_reductant_ABC_exporter_subunit_CydC <unknown description>

0
Entering edit mode

Hello fec2,

Please use the formatting bar (especially the code option) to present your post better.

I cannot clearly see how your sequence ID looks like.

Thank you!

0
Entering edit mode

I am sorry I am new in this field and this forum. I have edited the comment above. Thank you.

0
Entering edit mode

hi fec2 could you please post few more headers? pattern can be improved if we have more fasta headers.

0
Entering edit mode

Hi, here is example of 1 gene file:

>Planococcus_faecalis_AJ003-AJGP001_RS02220-1-thiol_reductant_ABC_exporter_subunit_CydC <unknown description>
>Planococcus_kocurii_ATCC_43650-AUO94_RS06550-1-thiol_reductant_ABC_exporter_subunit_CydC <unknown description>
>Planococcus_antarcticus_DSM14505-BBH88_RS02420-1-thiol_reductant_ABC_exporter_subunit_CydC <unknown description>
>Planococcus_versutus_L10_15-I858_RS02180-1-thiol_reductant_ABC_exporter_subunit_CydC <unknown description>
>Planococcus_halocryophilus_DSM24743-BBI08_RS02205-1-thiol_reductant_ABC_exporter_subunit_CydC <unknown description>
>Planococcus_sp_PAMC_21323-PLANO_RS01740-1-thiol_reductant_ABC_exporter_subunit_CydC <unknown description>
>Planococcus_donghaensis_MPA1U2-GPDM_RS01230-63-cydC_thiol_reductant_ABC_exporter_subunit_CydC <unknown description>
>Planococcus_donghaensis_DSM22276-BCM40_RS02220-1-thiol_reductant_ABC_exporter_subunit_CydC <unknown description>
>Planococcus_sp_SCU63-DP120_RS12840-13-cydC_thiol_reductant_ABC_exporter_subunit_CydC <unknown description>
>Planococcus_sp_CAU13-KW93_RS09120-323-thiol_reductant_ABC_exporter_subunit_CydC <unknown description>
>Planococcus_maritimus_Y42-B0X71_RS16755-1-thiol_reductant_ABC_exporter_subunit_CydC <unknown description>
>Planococcus_maritimus_SAMP-BHE17_RS12170-31-thiol_reductant_ABC_exporter_subunit_CydC <unknown description>
>Planococcus_maritimus_MKU009-AY633_RS12345-32-thiol_reductant_ABC_exporter_subunit_CydC <unknown description>

0
Entering edit mode

Could you please explain how the bacterias are delimited from the genes?

0
Entering edit mode

The pipeline using NCBI genebank file as input.

The ID actually contain

>Bacteria species and strain name-locus tag of the gene-gene name


For example for

>Planococcus_faecalis_AJ003-AJGP001_RS02220-1-thiol_reductant_ABC_exporter_subunit_CydC <unknown description>

>Planococcus_faecalis_AJ003(Bacteria species and strain name)-AJGP001_RS02220(locus tag of the gene)-1-thiol_reductant_ABC_exporter_subunit_CydC(gene name) <unknown description>


Thank you.

0
Entering edit mode

$for f in *.fa; do seqkit replace -p '^(.+?)-(.+)$' -r '$1$2' $f > f.$f; done

\$ seqkit concat f.*.fa

0
Entering edit mode

Sorry for not showing the real example in the beginning. The command is working but the ID of each bacteria all become 'S1'. I can't differentiate them. May I know have I done anything wrong?

I also noticed that for each f.*.fa file, the multiple fasta are in different order, I guess it just removed the bacteria species and strain name. As I have raised in the beginning, all the multiple fasta sequence are in random order in each file. Will it be affected when I concatenate them?

Traffic: 1611 users visited in the last hour
FAQ
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.