File conversion in bash, discrepancy in number of lines after conversion.
2
0
Entering edit mode
2.4 years ago
ATCG ▴ 370

csv file in excel and saved it as .txt and when I count the lines using wc -l there is one line less in the .txt format 17768 vs 17769.

1. Is this normal? and if so which line is lost in the conversion?
2. What is the easiest way to make this conversion in bash?

Thanks!

file conversion csv txt • 1.2k views
1
Entering edit mode

Hello again. You could check the output of head and tail on both files, and that should reveal the discrepancy.

Kevin

0
Entering edit mode

I did all lines match in both files

1
Entering edit mode

You could try the diff command. The discrepancy is likely related to encoding and end-line characters, e.g., Windows carriage returns (\r) versus end-line characters on Linux (\n)

0
Entering edit mode

head/tail would have freaked out on one of the files if the line endings had been different between them.

0
Entering edit mode

The initial assumption with head and tail was that there may have been a blank line at the beginning or end of one of the files. We cannot see the user's actions from where we are sitting.

1
Entering edit mode
2.4 years ago
GenoMax 121k

cat results1.csv | tr "," "\t"  > results1.txt

0
Entering edit mode

You are the winner!!! Thank you!!

1
Entering edit mode
2.4 years ago
Ram 37k

wc -l counts the number of new lines, not the number of lines. If your file ends with the last character of the last line, it will amount to one fewer line than the other file but diff would ignore differences. Try running a diff.

echo "Hello World" | wc
1 2 12


As you can see, the above has  line and 12 characters, whereas the one below where the new line is suppressed has 0 lines and 11 characters.

echo -n "Hello World" | wc
0 2 11

0
Entering edit mode

For future reference, how can I convert the .csv file to a .txt file in bash?

1
Entering edit mode

.csv is a file name extension for comma separated values while .txt is a file name extension for a plain text file. The former strongly alludes to its content format (comma separated values) whereas the latter only speaks of the nature of the content (plain text), not its format.

What is the difference between the content in the two files? If they're the same, you can simply copy/soft-link one to the other (an unnecessary step). If there is an actual difference, you could use sed to go from one to the other.

0
Entering edit mode

Thanks for the clarification. Suppose I have a .csv file and want to convert/save as this file to a .txt file how would I do that in bash? Usually what I do is from excel I save the .csv file as a .txt

1
Entering edit mode

You don't need to do that. Most programs that accept a .txt input will accept a .csv input. If they don't just rename the file.

1
Entering edit mode

And to rename the file in the console you can use mv file.csv file.txt, just in case that's what you were looking for.

1
Entering edit mode

If you check with file *.csv you will see that operating system will identify (a real csv) file as ASCII text.

0
Entering edit mode

Thanks but my questions is very simple I only have curiosity to know how I would save a .csv as a .txt. I looked this up and it seemed overwhelmingly complicated and I don’t understand why, it should be simple a simple command that can take one and produce the other.

3
Entering edit mode

No need to convert. csv is a text file where all the columns in the text file are separated by commas. If it is a text file where all the columns are separated by tab, then it is a tsv file. Based on the separator (in this case comma), text files are called csv, tsv etc. An apt question could be how to convert the one separator (comma) to another separator (tab) for any given text file.

2
Entering edit mode

All CSV files are TXT files, but not all TXT files are CSV files

0
Entering edit mode

this is all I need to know: FBgn0286941,0,0,0,NA,NA,NA,NA,NA,NA,NA,NA,0,0,0,0,0,0

not what I want, I want to print tab-separated fields like this:

FBgn0286941 0 0 0 NA NA NA NA NA NA NA NA 0 0 0 0 0 0

How to do this???

1
Entering edit mode
cat file |    tr "," "\t"  > new_file


Example

\$ echo FBgn0286941,0,0,0,NA,NA,NA,NA,NA,NA,NA,NA,0,0,0,0,0,0 | tr "," "\t"
FBgn0286941     0       0       0       NA      NA      NA      NA      NA      NA      NA      NA      0       0       0       0       0 0

0
Entering edit mode

tr "," "\t" results1.csv > results1.txt

usage: tr [-Ccsu] string1 string2 tr [-Ccu] -d string1 tr [-Ccu] -s string1 tr [-Ccu] -ds string1 string2

0 Apr 29 10:51 results1.txt

1
Entering edit mode

As mentioned csv is already a plain text format with "," or ";" as a separator. Two options:

1. replace the comma with another separator of your choice, for example by using tr
2. just rename the extension: mv myfile.csv myfile.txt.
0
Entering edit mode

doing that will print: cat test.txt

FBgn0286941,0,0,0,NA,NA,NA,NA,NA,NA,NA,NA,0,0,0,0,0,0

not what I want, I want to print tab-separated fields like this:

FBgn0286941 0 0 0 NA NA NA NA NA NA NA NA 0 0 0 0 0 0

Also, I don't want to replace the file so instead of mv one would use cp to keep the csv and the txt versions

1
Entering edit mode

You can open the CSV file in Excel, again, but ensure that Excel has correctly interpreted that the commas are the field delimiters. If it has correctly interpreted this, then the values will appear in multiple columns. Then save the file as TXT (tab-delimited).

Alternatively, if you have Mac or Linux, open a terminal and convert all commas to tabs via:

sed 's/,/\t/g' test.csv > test.tsv

0
Entering edit mode

This sed command will print:

FBgn0033358t502.2t899.9t104.6t0.1t-3.1t2.94230548037037e-60t4.57293117759163e-56t0t0t876.1t995.1t828.4t93.4t119.6t100.7

1
Entering edit mode

It should not:

echo "FBgn0286941,0,0,0,NA,NA,NA,NA,NA,NA,NA,NA,0,0,0,0,0,0" |\
sed 's/,/\t/g'
FBgn0286941 0   0   0   NA  NA  NA  NA  NA  NA  NA  NA  0   0   0   0   0   0


The functioning of sed` will be system-dependent. I use linux (Ubuntu), and GNU sed v4.2.2.

Thank you.

0
Entering edit mode