obtaining a concatenamer of sequences
3
0
Entering edit mode
5.7 years ago
annalisa79 ▴ 20

Ther all I was wondering if anyone could help me in obtaining a concatenamer of sequences in the way showed below. I have several multifasta files relative a genes sequences (ABC, GHJ…) in different organisms (>182680572 , >749299147…)

Gene ABC

>182680572
ATGGAATCTTGGTCCCGTTGCCTGGAACGTCTTGAAACTGAATTTCCACCAGAAGATGTTCATACTTGGTTGAGACCTTTACAAGCTGACCAACGCGGTGACAGTGTCATCCTTTACGCACCCAATACCTTTATCATTGAACTAGTAGAAGAGCGATA
>749299147
ATGACAACATTGATGGAATCTTGGTCCCGTTGCCTGGAACGTCTTGAAACTGAATTTCCACCAGAAGATGTTCATACTTGGTTGAGACCTTTACAAGCTGACCAACGCGGTGACAGTGTCATCCTTTACGCACCCAATACCTTTATCATTGAACTAGTAGAAGAGCGATACTTAGGGCGTCTTCGAGAATTGTTATCCTATTTTTCAGGAATACGTGAAGTAGTCCTTGCAATTGGCA
>584117620
ATGGAATCTTGGTCCCGTTGCCTGGAACGTCTTGAAACTGAATTTCCACCAGAAGATGTTCATACTTGGTTGAGACCTTTACAAGCTGACCAACGCGGTGACAGTGTCATCCTTTACGCACCCAATACCTTTATCATTGAACTAGTAGAAGAGCGATACTTAGGGCGTCTTCGAGAATTGTTATCCTATTTTTCAGGAATACGTGAAGTAGTCCTTGCAATTGGCTCACGACCTAA
>985743106
ATGACAACATTGATGGAATCTTGGTCCCGTTGCCTGGAACGTCTTGAAACTGAATTCCCGCCAGAAGATGTTCATACTTGGTTAAGACCTTTACAAGCCGACCAACGTGGTGACAGTGTCGTCCTTTACGCACCGAATCCCTTTATCATTGAACTAGTAGAAGAGCGATACTTAGGACGTCTTCGGGAATTGTTATCCTATTTTTCAGGAATACGTGAAGTAGTCCTTGCAATTGG

GENE GHJ

>182680572
ATGGAATCTTGGTCCCGTTGCCTGGAACGTCTTGAAACTGAATTTCCACCAGAAGATGTTCATACTTGGTTGAGACCTTTACAAGCTGACCAACGCGGTGACAGTGTCATCCTTTACGCACCCAATACCTTTATCATTGAACTAGTAGAAGAGCGATA
>749299147
ATGACAACATTGATGGAATCTTGGTCCCGTTGCCTGGAACGTCTTGAAACTGAATTTCCACCAGAAGATGTTCATACTTGGTTGAGACCTTTACAAGCTGACCAACGCGGTGACAGTGTCATCCTTTACGCACCCAATACCTTTATCATTGAACTAGTAGAAGAGCGATACTTAGGGCGTCTTCGAGAATTGTTATCCTATTTTTCAGGAATACGTGAAGTAGTCCTTGCAATTGGCA
>584117620
ATGGAATCTTGGTCCCGTTGCCTGGAACGTCTTGAAACTGAATTTCCACCAGAAGATGTTCATACTTGGTTGAGACCTTTACAAGCTGACCAACGCGGTGACAGTGTCATCCTTTACGCACCCAATACCTTTATCATTGAACTAGTAGAAGAGCGATACTTAGGGCGTCTTCGAGAATTGTTATCCTATTTTTCAGGAATACGTGAAGTAGTCCTTGCAATTGGCTCACGACCTAA
>985743106
ATGACAACATTGATGGAATCTTGGTCCCGTTGCCTGGAACGTCTTGAAACTGAATTCCCGCCAGAAGATGTTCATACTTGGTTAAGACCTTTACAAGCCGACCAACGTGGTGACAGTGTCGTCCTTTACGCACCGAATCCCTTTATCATTGAACTAGTAGAAGAGCGATACTTAGGACGTCTTCGGGAATTGTTATCCTATTTTTCAGGAATACGTGAAGTAGTCCTTGCAATTGG

Then I want to obtain for each organism a concatened sequence of the genes in the same order for each organisms

>182680572
ATGGAATCTTGGTCCCGTTGCCTGGAACGTCTTGAAACTGAATTTCCACCAGAAGATGTTCATACTTGGTTGAGACCTTTACAAGCTGACCAACGCGGTGACAGTGTCATCCTTTACGCACCCAATACCTTTATCATTGAACTAGTAGAAGAGCGATAATGGAATCTTGGTCCCGTTGCCTGGAACGTCTTGAAACTGAATTTCCACCAGAAGATGTTCATACTTGGTTGAGACCTTTACAAGCTGACCAACGCGGTGACAGTGTCATCCTTTACGCACCCAATACCTTTATCATTGAACTAGTAGAAGAGCGATA
>749299147
ATGACAACATTGATGGAATCTTGGTCCCGTTGCCTGGAACGTCTTGAAACTGAATTTCCACCAGAAGATGTTCATACTTGGTTGAGACCTTTACAAGCTGACCAACGCGGTGACAGTGTCATCCTTTACGCACCCAATACCTTTATCATTGAACTAGTAGAAGAGCGATACTTAGGGCGTCTTCGAGAATTGTTATCCTATTTTTCAGGAATACGTGAAGTAGTCCTTGCAATTGGCAATGACAACATTGATGGAATCTTGGTCCCGTTGCCTGGAACGTCTTGAAACTGAATTTCCACCAGAAGATGTTCATACTTGGTTGAGACCTTTACAAGCTGACCAACGCGGTGACAGTGTCATCCTTTACGCACCCAATACCTTTATCATTGAACTAGTAGAAGAGCGATACTTAGGGCGTCTTCGAGAATTGTTATCCTATTTTTCAGGAATACGTGAAGTAGTCCTTGCAATTGGCA
>584117620
ATGGAATCTTGGTCCCGTTGCCTGGAACGTCTTGAAACTGAATTTCCACCAGAAGATGTTCATACTTGGTTGAGACCTTTACAAGCTGACCAACGCGGTGACAGTGTCATCCTTTACGCACCCAATACCTTTATCATTGAACTAGTAGAAGAGCGATACTTAGGGCGTCTTCGAGAATTGTTATCCTATTTTTCAGGAATACGTGAAGTAGTCCTTGCAATTGGCTCACGACCTAA ATGGAATCTTGGTCCCGTTGCCTGGAACGTCTTGAAACTGAATTTCCACCAGAAGATGTTCATACTTGGTTGAGACCTTTACAAGCTGACCAACGCGGTGACAGTGTCATCCTTTACGCACCCAATACCTTTATCATTGAACTAGTAGAAGAGCGATACTTAGGGCGTCTTCGAGAATTGTTATCCTATTTTTCAGGAATACGTGAAGTAGTCCTTGCAATTGGCTCACGACCTAA
>985743106
ATGACAACATTGATGGAATCTTGGTCCCGTTGCCTGGAACGTCTTGAAACTGAATTCCCGCCAGAAGATGTTCATACTTGGTTAAGACCTTTACAAGCCGACCAACGTGGTGACAGTGTCGTCCTTTACGCACCGAATCCCTTTATCATTGAACTAGTAGAAGAGCGATACTTAGGACGTCTTCGGGAATTGTTATCCTATTTTTCAGGAATACGTGAAGTAGTCCTTGCAATTGGATGACAACATTGATGGAATCTTGGTCCCGTTGCCTGGAACGTCTTGAAACTGAATTCCCGCCAGAAGATGTTCATACTTGGTTAAGACCTTTACAAGCCGACCAACGTGGTGACAGTGTCGTCCTTTACGCACCGAATCCCTTTATCATTGAACTAGTAGAAGAGCGATACTTAGGACGTCTTCGGGAATTGTTATCCTATTTTTCAGGAATACGTGAAGTAGTCCTTGCAATTGG

Does anyone knows how to do it with a perl/python script or bioinformatic software?

concatenamer • 1.5k views
ADD COMMENT
0
Entering edit mode

Hello,

how is the order of the sequence files that should be concatenate determined? Sorted by filename? Manual order?

Why do you like to do this?

fin swimmer

ADD REPLY
0
Entering edit mode

It depends on the user.

ADD REPLY
0
Entering edit mode

Yes, I know. But I'm not sure whether the OP knows that. That's because I'm asking.

ADD REPLY
3
Entering edit mode
5.7 years ago

https://bioinf.shenwei.me/seqkit/usage/#concat

seqkit concat *.fasta > result.fa
ADD COMMENT
0
Entering edit mode
5.7 years ago
harish ▴ 450

Assuming you have all the sequences in the same order in multiple files, you can probably do something like:

paste file1 file2 file3 | sed 's/\t>.*//g' | tr -d '\t' > concat.fa

The "sed" part is used for removing the fasta headers after the first tab generated by the paste command and "tr" is used to remove the tabs.

However if the sequences aren't in the same order then you'll have to do some file manipulation.

ADD COMMENT
0
Entering edit mode

Hi,

I have the same problem and the sequences aren't in the same order. May I know what should I do? Thank you.

ADD REPLY
0
Entering edit mode

Just try seqkit ... The orders do not matter.

ADD REPLY
0
Entering edit mode
5.7 years ago

I like seqkit. But here also an awk solution:

$ cat *.fa|awk -v RS=">" -v FS="\n" -v OFS="\n" '$0 {seq[$1] = seq[$1]$2}; END {for(id in seq){print ">"id, seq[id]}}'

fin swimmer

ADD COMMENT
0
Entering edit mode

Hi, may I know can this awk script be apply in my case, where the sequence ID slightly different?

ADD REPLY
0
Entering edit mode

Ah, now I see the difference to your question here. I will reopen it, as this difference is important. Let's discuss there.

I also deleted your posts here to keep the thread focused on the OP's problem description.

fin swimmer

ADD REPLY

Login before adding your answer.

Traffic: 3101 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6