How to multiply all columns of a .csv table with a constant extracted from another table?
2
0
Entering edit mode
3.6 years ago

I have 20 files where the filenames are like: ERR260136.genefamilies.csv, ERR276187.genefamilies.csv, etc. Each file has to be multiplied by one constant. The corresponding constant has to be taken from a .csv file named read_count.csv. The read_count.csv file is like:

SampleID Read_counts
ERR260136 25636740
ERR260140 19166076
ERR260145 28011856
ERR260147 27916650
ERR260148 21871928
ERR260150 30130062
ERR260152 17949808

So, ERR260136.genefamilies.csv has to be multiplied by 25636740, ERR260140.genefamilies.csv has to be multiplied by 19166076 and so on...

The 20 files to be multiplied are in this format:

 #Gene Family                                                   ERR260136_Abundance-RPKs
    UNMAPPED                                                            0.445035
    UniRef90_A0A015P9C8                                                 0.00080211
    UniRef90_A0A015P9C8|g__Bacteroides.s__Bacteroides_fragilis          0.00080211
    UniRef90_A5ZYU5                                                     0.000787149

How can I do this using bash comand? Can anyone please help?

bash • 4.7k views
ADD COMMENT
1
Entering edit mode

Hello, your thread looks super close to what can be asked in an assignment. Could you show us what you have tried so far so we can use it as a start ? Thanks !

You can read your first file with awk, then for every line (sampleID/Read_counts) you pipe it into another awk command opening the according file (sampleID) multiplying your column with the constant (Read_counts)

ADD REPLY
0
Entering edit mode

It is not an assignment. I need this operation while analyzing genomic data. Being new in programing I can only find out read counts for a sample from the first file (read_count.csv) by this:

awk /ERR260136/{'print'} read_count.csv | awk {'print$2'}

Can you please show me some way? I am stuck.

ADD REPLY
0
Entering edit mode

Is the indentation of your example of 1 of the 20 files correct, or is it supposed to be a fairly simple tabular format? You're extension implies csv, but it doesn't appear to be.

I am also going to strongly advise you not to do this with bash as you requested, as it does not support floating point arithmetic without a lot of complicated fudging. See here for an example of why A: assigning the values in matrix in bash

ADD REPLY
0
Entering edit mode

Yes. It is supposed to be a .csv file only. Actually, I modified so that it can look a table here in the post. It should be like:

# Gene Family   ERR260136_Abundance-RPKs
UNMAPPED    0.445035
UniRef90_A0A015P9C8 0.00080211
UniRef90_A0A015P9C8|g__Bacteroides.s__Bacteroides_fragilis  0.00080211
UniRef90_A5ZYU5 0.000787149
UniRef90_A5ZYU5|g__Blautia.s__Blautia_obeum 0.000787149
UniRef90_A0A0E1X896 0.000573095
UniRef90_A0A0E1X896|g__Blautia.s__Blautia_obeum 0.000573095
ADD REPLY
0
Entering edit mode

That is still not a csv, you need to be more clear about what these files exactly are.

ADD REPLY
1
Entering edit mode
3.6 years ago

Assuming both read_count.csv and all genefamilies.csv are tabulated files containing 2 columns (read_count.csv doesn't really matter), and that ´awk´ number formatting is enough for you:

for file in *.genefamilies.csv; do
 awk -v m=$(grep -F -w ${file/.genefamilies.csv} read_count.csv | awk '{print $2}') \
 'FS=OFS="\t"{NR>1 && $2*=m; print}' $file > ${file/.csv/.multiplied.csv}
done

grep extracts the multiplying factor, and the main awk loops though each line multiplying that second column for each *.genefamilies.csv existing file.

ADD COMMENT
0
Entering edit mode
3.6 years ago
Joe 21k

The following python approach will work, with a few caveats:

  1. The use of regex to replace the number is purely because I don't know how fixed your output format is.
  2. You may want to run this code on one file at a time instead of all of them, so that you can output each file to a new file (or add a write in to the code)
  3. I'm assuming all your values are 0.00123. If you need to account for integers etc too, then the regex needs to change.

import sys
import os
import re

# Read the readcounts file and store a dictionary of Filename: multiplier

readcounts = {}
with open(sys.argv[1], 'r') as rc:
    next(rc)  # skip the header
    for line in rc:
        (key, val) = line.split()
        readcounts[key] = int(val)

# Read the file(s)

for file in sys.argv[2:]:
    multiplier = readcounts[file.split(os.extsep)[0]] # look up the relevant multiplier from the dict
    with open(file, 'r') as fh:
        for line in fh:
            print(re.sub(r'\d+\.\d*',
                         lambda m: str(float(m.group(0)) * multiplier),
                         line).strip("\n"))

Usage:

$ python scriptname.py readcounts.csv /path/to/files_to_multiply/*.ext

Inputs:

cat ERR260136.genefamilies.csv

 #Gene Family                                                   ERR260136_Abundance-RPKs
    UNMAPPED                                                            0.445035
    UniRef90_A0A015P9C8                                                 0.00080211
    UniRef90_A0A015P9C8|g__Bacteroides.s__Bacteroides_fragilis          0.00080211
    UniRef90_A5ZYU5                                                     0.000787149

cat ERR260140.genefamilies.csv (some sample data I made up)

#Gene Family                                                   ERR260136_Abundance-RPKs
    UNMAPPED                                                            0.1239710902
    UniRef90_A0A015P9C8                                                 0.000812739
    UniRef90_A0A015P9C8|g__Bacteroides.s__Bacteroides_fragilis          0.0031
    UniRef90_AYCHAYB                                                    0.0007871123123131

Output:

 python multiplying.py readcounts.csv *.genefamilies.csv
 #Gene Family                                                   ERR260136_Abundance-RPKs
    UNMAPPED                                                            11409246.585900001
    UniRef90_A0A015P9C8                                                 20563.485521399998
    UniRef90_A0A015P9C8|g__Bacteroides.s__Bacteroides_fragilis          20563.485521399998
    UniRef90_A5ZYU5                                                     20179.93425426
 #Gene Family                                                   ERR260136_Abundance-RPKs
    UNMAPPED                                                            2376039.3365760553
    UniRef90_A0A015P9C8                                                 15577.017442164
    UniRef90_A0A015P9C8|g__Bacteroides.s__Bacteroides_fragilis          59414.8356
    UniRef90_AYCHAYB                                                    15085.85439832861
ADD COMMENT
0
Entering edit mode

Sir, I am not familiar with programming. So, I am not sure what do you mean by "I don't know how fixed your output format is". And, regarding point 3

I'm assuming all your values are 0.00123. If you need to account for integers etc too, then the regex needs to change.

, the second column is relative abundance data sum normalized to 1.00. Therefore, all of the values supposed to be less than 1 and contains decimal points.

ADD REPLY
0
Entering edit mode

I mean I have attempted to make the file output look the same as the template you provided, but you are still providing conflicting information in calling these CSV files when nothing you have shown us so far is comma-separated.

ADD REPLY

Login before adding your answer.

Traffic: 1596 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6