Question: HOW TO remove all spaces except first space with semicolon
0
gravatar for Bioinfonext
10 months ago by
Bioinfonext220
Korea
Bioinfonext220 wrote:

Hi,

I need to generate a taxonomy txt file having semicolon between them instead of spaces, but it should have first spaces after gene ID.

AJQY01000137.1  Bacteria     Firmicutes  Bacilli     Lactobacillales     Streptococcaceae    Streptococcus   
AJRA01000005.1  Bacteria     Firmicutes  Bacilli     Lactobacillales     Streptococcaceae    Streptococcus   
AJRA01000158.1  Bacteria     Firmicutes  Bacilli     Lactobacillales     Streptococcaceae    Streptococcus

I need to have output like this:

AJQY01000137.1  Bacteria;Firmicutes;Bacilli;Lactobacillales;Streptococcaceae;Streptococcus   
AJRA01000005.1  Bacteria;Firmicutes;Bacilli;Lactobacillales;Streptococcaceae;Streptococcus   
AJRA01000158.1  Bacteria;Firmicutes;Bacilli;Lactobacillales;Streptococcaceae;Streptococcus
linux bash • 400 views
ADD COMMENTlink modified 10 months ago • written 10 months ago by Bioinfonext220
1

This perl one liner could work

perl -lane '{printf ((shift @F)." ");print join(";",@F) }' your_input_file

or with sed skipping the first match

sed -s 's/\s\+/;/2g' your_input_file

ADD REPLYlink modified 10 months ago • written 10 months ago by microfuge1.7k

thanks,

But this also removed spaces between lines as well.

perl -lane '{printf ((shift @F)." ");print join(";",@F) }'

I do not want to remove spaces between lines.

ADD REPLYlink modified 10 months ago • written 10 months ago by Bioinfonext220

Use awk and based on the value of NF, treat $1 and the rest of the fields differently. This way, you can retain blank lines while custom formatting other lines. Figuring out the awk code for yourself will be a good learning exercise.

ADD REPLYlink written 10 months ago by RamRS27k

Same question as in my earlier post: why is this question considered within a scope when it deals with simple manipulation of text columns? Is it that the biological content of text make it relevant to bioinformatics?

By the way, consider this command:

awk '{print $1, $2";"$3";"$4";"$5";"$6";"$7}' input_file > output_file

ADD REPLYlink written 10 months ago by Mensur Dlakic5.8k

thanks, I was trying to type the same command after Ram suggestion but it only gives first line as output: Is there any issue with my input file or for running this command to all over the lines needs to modify;

ABYV02000002.1 Archaea;Euryarchaeota;Methanobacteria;Methanobacteriales;Methanobacteriaceae;Methanobrevibacter

Thanks Bioinfonext

ADD REPLYlink modified 10 months ago • written 10 months ago by Bioinfonext220

When I save excel sheet to tab delimited format, it saved in a weird look and also inserted ^M character somehow: it do not save each line of excel in a separate line in tab delimited format.

ABYV02000002.1  Archaea  Euryarchaeota   Methanobacteria         Methanobacteriales      Methanobacteriaceae     Methanobrevibacter      Methanobrevibacter smithii DSM 2374**^M**ABYV02000006.1    Archaea  Euryarchaeota   Methanobacteria         Methanobacteriales      Methanobacteriaceae     Methanobrevibacter      Methanobrevibacter smithii DSM 2374^MABYW01000005.1    Archaea  Euryarchaeota   Methanobacteria         Methanobacteriales      Methanobacteriaceae     Methanobrevibacter      Methanobrevibacter smithii DSM 2375**^M**ABYW01000007.1
ADD REPLYlink written 10 months ago by Bioinfonext220

excel sheet look like this:

ABYV02000002.1  Archaea  Euryarchaeota   Methanobacteria     Methanobacteriales  Methanobacteriaceae     Methanobrevibacter  Methanobrevibacter smithii DSM 2374
ABYV02000006.1  Archaea  Euryarchaeota   Methanobacteria     Methanobacteriales  Methanobacteriaceae     Methanobrevibacter  Methanobrevibacter smithii DSM 2374
ABYW01000005.1  Archaea  Euryarchaeota   Methanobacteria     Methanobacteriales  Methanobacteriaceae     Methanobrevibacter  Methanobrevibacter smithii DSM 2375
ABYW01000007.1  Archaea  Euryarchaeota   Methanobacteria     Methanobacteriales  Methanobacteriaceae     Methanobrevibacter  Methanobrevibacter smithii DSM 2375
ADD REPLYlink written 10 months ago by Bioinfonext220
1

hi @bioinfonext, as Mensur Dlakic suggested, this is not a problem specific to bioinformatics, it's a classic sorting problem and now in addition a classic windows/unix newline problem. Whatever editor you use to visualise the tab delimited file doesn't seem to handle the windows carriage return very well - see this post on stack overflow. It's also likely you stripped the newline character at some stage in the process, hence you see everything in one line.

These are classic beginners errors which all of us did. You can solve them easily with your own web search and I guarantee this will prepare you for the future.

ADD REPLYlink written 10 months ago by Carambakaracho2.2k
1

thanks for your all help.

this command works for me:

cat final.taxonomy.txt | tr "\r" "\n" > final.taxonomy2.txt

Thanks again
Bioinfonext

ADD REPLYlink modified 10 months ago • written 10 months ago by Bioinfonext220

We are not looking at your computer, @Bioinfonext. Please use a package to read this data into R if you're having difficulties working on the content - these comments are just asking us for a lot of handholding.

ADD REPLYlink written 10 months ago by RamRS27k

You'd be better off copy-pasting from Excel to a plain text application (TextWrangler/Sublime Text/Atom/Notepad++) than using Excel to save the document.

You can use one of the above tools to open the document and try and change line endings, invalid characters, etc.

ADD REPLYlink written 10 months ago by RamRS27k

We try to be as lenient as possible, but we are aware that drawing a well defined line is a problem. You are welcome to discuss and offer solutions on our slack channel.

ADD REPLYlink written 10 months ago by RamRS27k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 971 users visited in the last hour