Convert large text file back into a FASTA file
1
0
Entering edit mode
4.8 years ago
Angie11 • 0

Hello, Does anyone know of a tool to convert a large .txt file (~850MB) into a .FASTA file?

I converted a FASTA sequence file into a txt file using the fasta_formatter of the FASTX tool on Linux so I could combine the accession numbers with descriptions from another txt file using a join command. Now it is ~850MB and I need to convert it back to a FASTA file.

Thank you, Angie

sequence • 1.4k views
ADD COMMENT
0
Entering edit mode

Can you give an example of how the file looks like?

head your.file
ADD REPLY
0
Entering edit mode

Formatted by @RamRS:

Bacteroides genus root|cellular organisms|Bacteria|Bacteroidetes/Chlorobi group|Bacteroidetes|Bacteroidia|Bacteroidales|Bacteroidaceae|Bacteroides no rank|no rank|superkingdom|superphylum|phylum|class|order|family|genus10_GL0085768 [gene] locus=scaffold30815_5:6223:7791:- [Complete] codon-table.11
MNSEIERRRTFAIIAHPDAGKTSLTEKLLLFGGQIQVAGAVKSNKIKKTATSDWMDIEKQRGISVTTSVMEFDYNDYKINILDTPGHQDFAEDTYRTLTAVDSVIIVVDGAKGVETQTRKLMEVCRMRNTPVIIFVNKMDREAKDPFDLLDELEEELIINVRPLTWPIESGPRFKGVYNLYEHKLNLFQPSKQVVTEKVEVDINTEELDNQIGAPLAEKLRGELELVDGVYPEFNVEEYLKGEMAPVFFGSALNNFGVQELLDTFVEIAPSPRPTKTEEREVEPDEPKFTGFVFKITANIDPNHRSCIAFCKICSGKFSRNTPYYHVRHDKTMRFSSPTQFMAQRKTTVDEAWAGDIIGLPDNGTFKIGDTLTEGEKLHFRGIPSFSPEMFKYIENADPMKQKQLAKGIDQLMDEGVAQLFINQFNGRKIIGTVGQLQFEVIQYRLENEYNAKCRWEPISLYKACWVESDDPEELEAFKKRKYQYMAKDREGRDVFLADSNYVLQMAQMDFKHIKFHFTSEF 1/1
Bacteroides vulgatus species root|cellular organisms|Bacteria|Bacteroidetes/Chlorobi group|Bacteroidetes|Bacteroidia|Bacteroidales|Bacteroidaceae|Bacteroides|Bacteroides vulgatus no rank|no rank|superkingdom|superphylum|phylum|class|order|family|genus|species10_GL0085769 [gene] locus=scaffold30815_5:7798:8655:- [Complete] codon-table.11
MKNILVTGANGQLGNEMRVLSAEYKEYTCFFTDVAELDICDEQAVMTFVKENNIHVIVNCAAYTAVDKAEDDIELCTKLNKNAVSYLAKAAEANWGEFIQISTDYVFDGTKHLPYNEGDVPCPNSVYGKTKLAGETNALEYCKKTMIIRTAWLYSTFGNNFVKTMLRLGKEKETLGVVFDQIGTPTYARDLARAIFTAIYKGVVPGVYHFSDEGVCSWYDFTKAIHRIAGITTCKVSPLHTNEYPAKAPRPHYSVLDKTKIKTTYNIEIPHWEESLEACIKELNA

Original content:

Bacteroides genus root|cellular organisms|Bacteria|Bacteroidetes/Chlorobi group|Bacteroidetes|Bacteroidia|Bacteroidales|Bacteroidaceae|Bacteroides no rank|no rank|superkingdom|superphylum|phylum|class|order|family|genus10_GL0085768 [gene] locus=scaffold30815_5:6223:7791:- [Complete] codon-table.11MNSEIERRRTFAIIAHPDAGKTSLTEKLLLFGGQIQVAGAVKSNKIKKTATSDWMDIEKQRGISVTTSVMEFDYNDYKINILDTPGHQDFAEDTYRTLTAVDSVIIVVDGAKGVETQTRKLMEVCRMRNTPVIIFVNKMDREAKDPFDLLDELEEELIINVRPLTWPIESGPRFKGVYNLYEHKLNLFQPSKQVVTEKVEVDINTEELDNQIGAPLAEKLRGELELVDGVYPEFNVEEYLKGEMAPVFFGSALNNFGVQELLDTFVEIAPSPRPTKTEEREVEPDEPKFTGFVFKITANIDPNHRSCIAFCKICSGKFSRNTPYYHVRHDKTMRFSSPTQFMAQRKTTVDEAWAGDIIGLPDNGTFKIGDTLTEGEKLHFRGIPSFSPEMFKYIENADPMKQKQLAKGIDQLMDEGVAQLFINQFNGRKIIGTVGQLQFEVIQYRLENEYNAKCRWEPISLYKACWVESDDPEELEAFKKRKYQYMAKDREGRDVFLADSNYVLQMAQMDFKHIKFHFTSEF 1/1Bacteroides vulgatus species root|cellular organisms|Bacteria|Bacteroidetes/Chlorobi group|Bacteroidetes|Bacteroidia|Bacteroidales|Bacteroidaceae|Bacteroides|Bacteroides vulgatus no rank|no rank|superkingdom|superphylum|phylum|class|order|family|genus|species10_GL0085769 [gene] locus=scaffold30815_5:7798:8655:- [Complete] codon-table.11MKNILVTGANGQLGNEMRVLSAEYKEYTCFFTDVAELDICDEQAVMTFVKENNIHVIVNCAAYTAVDKAEDDIELCTKLNKNAVSYLAKAAEANWGEFIQISTDYVFDGTKHLPYNEGDVPCPNSVYGKTKLAGETNALEYCKKTMIIRTAWLYSTFGNNFVKTMLRLGKEKETLGVVFDQIGTPTYARDLARAIFTAIYKGVVPGVYHFSDEGVCSWYDFTKAIHRIAGITTCKVSPLHTNEYPAKAPRPHYSVLDKTKIKTTYNIEIPHWEESLEACIKELNA

ADD REPLY
0
Entering edit mode

@RamRS, What command did you use to reformat this? Thank you!

ADD REPLY
0
Entering edit mode

The code option in the formatting bar (10101 button).

ADD REPLY
0
Entering edit mode

I see, thank you. I was wondering if there is a command in Linux to reformat a file that is in the messy format above into 2 lines like RamRS did? - Essentially do what the 10101 button does but through the command line, on a large txt file

ADD REPLY
0
Entering edit mode

input:

$ cat test.txt 
Bacteroides genus root|cellular organisms|Bacteria|Bacteroidetes/Chlorobi group|Bacteroidetes|Bacteroidia|Bacteroidales|Bacteroidaceae|Bacteroides no rank|no rank|superkingdom|superphylum|phylum|class|order|family|genus10_GL0085768 [gene] locus=scaffold30815_5:6223:7791:- [Complete] codon-table.11
MNSEIERRRTFAIIAHPDAGKTSLTEKLLLFGGQIQVAGAVKSNKIKKTATSDWMDIEKQRGISVTTSVMEFDYNDYKINILDTPGHQDFAEDTYRTLTAVDSVIIVVDGAKGVETQTRKLMEVCRMRNTPVIIFVNKMDREAKDPFDLLDELEEELIINVRPLTWPIESGPRFKGVYNLYEHKLNLFQPSKQVVTEKVEVDINTEELDNQIGAPLAEKLRGELELVDGVYPEFNVEEYLKGEMAPVFFGSALNNFGVQELLDTFVEIAPSPRPTKTEEREVEPDEPKFTGFVFKITANIDPNHRSCIAFCKICSGKFSRNTPYYHVRHDKTMRFSSPTQFMAQRKTTVDEAWAGDIIGLPDNGTFKIGDTLTEGEKLHFRGIPSFSPEMFKYIENADPMKQKQLAKGIDQLMDEGVAQLFINQFNGRKIIGTVGQLQFEVIQYRLENEYNAKCRWEPISLYKACWVESDDPEELEAFKKRKYQYMAKDREGRDVFLADSNYVLQMAQMDFKHIKFHFTSEF 1/1
Bacteroides vulgatus species root|cellular organisms|Bacteria|Bacteroidetes/Chlorobi group|Bacteroidetes|Bacteroidia|Bacteroidales|Bacteroidaceae|Bacteroides|Bacteroides vulgatus no rank|no rank|superkingdom|superphylum|phylum|class|order|family|genus|species10_GL0085769 [gene] locus=scaffold30815_5:7798:8655:- [Complete] codon-table.11
MKNILVTGANGQLGNEMRVLSAEYKEYTCFFTDVAELDICDEQAVMTFVKENNIHVIVNCAAYTAVDKAEDDIELCTKLNKNAVSYLAKAAEANWGEFIQISTDYVFDGTKHLPYNEGDVPCPNSVYGKTKLAGETNALEYCKKTMIIRTAWLYSTFGNNFVKTMLRLGKEKETLGVVFDQIGTPTYARDLARAIFTAIYKGVVPGVYHFSDEGVCSWYDFTKAIHRIAGITTCKVSPLHTNEYPAKAPRPHYSVLDKTKIKTTYNIEIPHWEESLEACIKELNA

output:

$ sed '1~2 s/^/>/; 2~2 s/\s.*//' test.txt 

>Bacteroides genus root|cellular organisms|Bacteria|Bacteroidetes/Chlorobi group|Bacteroidetes|Bacteroidia|Bacteroidales|Bacteroidaceae|Bacteroides no rank|no rank|superkingdom|superphylum|phylum|class|order|family|genus10_GL0085768 [gene] locus=scaffold30815_5:6223:7791:- [Complete] codon-table.11
MNSEIERRRTFAIIAHPDAGKTSLTEKLLLFGGQIQVAGAVKSNKIKKTATSDWMDIEKQRGISVTTSVMEFDYNDYKINILDTPGHQDFAEDTYRTLTAVDSVIIVVDGAKGVETQTRKLMEVCRMRNTPVIIFVNKMDREAKDPFDLLDELEEELIINVRPLTWPIESGPRFKGVYNLYEHKLNLFQPSKQVVTEKVEVDINTEELDNQIGAPLAEKLRGELELVDGVYPEFNVEEYLKGEMAPVFFGSALNNFGVQELLDTFVEIAPSPRPTKTEEREVEPDEPKFTGFVFKITANIDPNHRSCIAFCKICSGKFSRNTPYYHVRHDKTMRFSSPTQFMAQRKTTVDEAWAGDIIGLPDNGTFKIGDTLTEGEKLHFRGIPSFSPEMFKYIENADPMKQKQLAKGIDQLMDEGVAQLFINQFNGRKIIGTVGQLQFEVIQYRLENEYNAKCRWEPISLYKACWVESDDPEELEAFKKRKYQYMAKDREGRDVFLADSNYVLQMAQMDFKHIKFHFTSEF
>Bacteroides vulgatus species root|cellular organisms|Bacteria|Bacteroidetes/Chlorobi group|Bacteroidetes|Bacteroidia|Bacteroidales|Bacteroidaceae|Bacteroides|Bacteroides vulgatus no rank|no rank|superkingdom|superphylum|phylum|class|order|family|genus|species10_GL0085769 [gene] locus=scaffold30815_5:7798:8655:- [Complete] codon-table.11
MKNILVTGANGQLGNEMRVLSAEYKEYTCFFTDVAELDICDEQAVMTFVKENNIHVIVNCAAYTAVDKAEDDIELCTKLNKNAVSYLAKAAEANWGEFIQISTDYVFDGTKHLPYNEGDVPCPNSVYGKTKLAGETNALEYCKKTMIIRTAWLYSTFGNNFVKTMLRLGKEKETLGVVFDQIGTPTYARDLARAIFTAIYKGVVPGVYHFSDEGVCSWYDFTKAIHRIAGITTCKVSPLHTNEYPAKAPRPHYSVLDKTKIKTTYNIEIPHWEESLEACIKELNA
ADD REPLY
2
Entering edit mode
4.8 years ago

your example is not formatted. assuming there are two lines per record

awk '{printf("%s%s\n",NR%2==1?">":"",$1);}' input.txt > out.fa

if there is only one line per record, separator is the tab

awk -F '\t'  '{printf(">%s\n%s\n",$1,$2);}' input.txt > out.fa
ADD COMMENT
0
Entering edit mode

It seems to be working! Thank you so much Pierre :)

ADD REPLY

Login before adding your answer.

Traffic: 2021 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6