Question

How to merge two csv tables

0

Entering edit mode

7.3 years ago

cristina_sabiers ▴ 110

I try to learn by code how to merge tables, I have 1.csv, 2 csv

PART OF 1.CSV:

 GENE1  MAP DESCRIPTION
 5-HT3C2    3q27.1  5-hydroxytryptamine receptor 3E pseudogene
 A1BG   19q13.43    alpha-1-B glycoprotein
 A1BG-AS1   19q13.43    A1BG antisense RNA 1
 A1CF   10q11.23    APOBEC1 complementation factor
 A2M    12p13.31    alpha-2-macroglobulin
 A2M-AS1    12p13.31    A2M antisense RNA 1 (head to head)
 A2ML1  12p13.31    alpha-2-macroglobulin like 1

Part of 2.csv

 GENE   CHROM   POS
 AADACL2-AS1    chr3    151545320
 AADAT  chr4    170999757
 ABCA10 chr17   67178785
 ABCA12 chr2    215884092
 ABCA2  chr9    139904734
 ABCA4  chr1    94528499
 ABCA5  chr17   67304275
 ABCA6  chr17   67096772

I try to get a csv file like

GENE CHROM POS MAP DESCRIPTION where common genes are merged

I use this code:

 import csv

 diseases = {}

 # Load the disease file in memory
 with open('1.csv', 'rb') as dfile:
  reader = csv.reader(dfile)
  next(reader, None)
  # Skip the header
  dfile.next()
   for GENE1, MAP, DESCRIPTION in dfile:
       diseases[GENE1] = (MAP, DESCRIPTION)


      with open('2.csv', 'rb') as idfile:
         reader = csv.reader(idfile)
         next(reader, None)
         # Skip the header
          idfile.next()
          for GENE, CHROM, POS in idfile:
          if GENE in diseases:
              output.writerow((GENE, CHROM, POS) + diseases[GENE1])

But I get this error message

     python TEST.py
     Traceback (most recent call last):
      File "TEST.py", line 11, in <module>
      for GENE1, MAP, DESCRIPTION in dfile:
      ValueError: too many values to unpack

What I do wrong?...Im not a programmer so is pretty hard for me try to learn all of this. If is another easy way to do...

Thanks!!!

csv • 3.7k views

ADD COMMENT • link updated 7.3 years ago by shenwei356 8.4k • written 7.3 years ago by cristina_sabiers ▴ 110

1

Entering edit mode

The lines/rows in dfile are not what you think they are. Try to just print them out to get a better idea:

for row in dfile:
    print(row)

ADD REPLY • link 7.3 years ago by WouterDeCoster 47k

0

Entering edit mode

ok, let me see if i do

     import csv

     # Load the disease file in memory
      with open('1.csv', 'rb') as dfile:
       reader = csv.reader(dfile)
       next(reader, None)
      # Skip the header
       dfile.next()
       for row in dfile:
          print(row)

I get my three rows like in csv file

 .............
 ZYG11B 1p32.3  zyg-11 family member B, cell cycle regulator

 ZYX    7q34    zyxin

 ZYXP1  8q24.23 zyxin pseudogene 1

 ZZEF1  17p13.2 zinc finger ZZ-type and EF-hand domain containing 1

 ZZZ3   1p31.1  zinc finger ZZ-type containing 3

ADD REPLY • link 7.3 years ago by cristina_sabiers ▴ 110

1

Entering edit mode

What's the len() of row? I'm not sure if you should explicitly set the field delimiter for csv.reader()

e.g.

for row in dfile:
    if not len(row) == 3:
        print("Unexpected length {} for row!\n{}".format(len(row), row))

ADD REPLY • link 7.3 years ago by WouterDeCoster 47k

0

Entering edit mode

ok, now I think understand what you mean, is because I have like this:

ZSCAN30 18q12.2 zinc finger and SCAN domain containing 30

So the csv takes as 9 rows instead 3 right? The strange thing is on excel I see it just as three rows...

I converted my files from xls renamed to csv...probably I did wrong right?

The lenght of each row is about 59.000 fields

Thanks for the help!

ADD REPLY • link 7.3 years ago by cristina_sabiers ▴ 110

1

Entering edit mode

You are mixing up rows and columns.
It can be that csv.reader() takes both spaces and tabs as delimiter, while Excel only took the tabs. If you set the delimiter explicitly, this might be solved. But having no access to your data I can only guess and suggest you what to look at.

ADD REPLY • link 7.3 years ago by WouterDeCoster 47k

0

Entering edit mode

you are right I mess up words... ^^"

Can I send you the two files? if you can help me I would really appreciate, I just want to learn how to merge files, so I can work easily adding information on my csv files.

ADD REPLY • link 7.3 years ago by cristina_sabiers ▴ 110

1

Entering edit mode

Have you tried setting the delimiter in csv reader explicitly to tab?
What was the result of my code in C: How to merge two csv tables ?

ADD REPLY • link 7.3 years ago by WouterDeCoster 47k

0

Entering edit mode

   File "TEST3.py", line 11
    if not len(row) == 3:
    ^
    IndentationError: expected an indented block

Got this...no idea what means I m googeling

ADD REPLY • link 7.3 years ago by cristina_sabiers ▴ 110

0

Entering edit mode

That means your code block indentation is not right, i.e. that your probably have this:

for row in dfile:
if not len(row) == 3:
        print("Unexpected length {} for row!\n{}".format(len(row), row))

instead of this:

for row in dfile:
    if not len(row) == 3:
        print("Unexpected length {} for row!\n{}".format(len(row), row))

ADD REPLY • link 7.3 years ago by WouterDeCoster 47k

score 3 · Answer 1 · 2017-01-14

3

Entering edit mode

7.3 years ago

shenwei356 8.4k

Although you want to solve this by Python script, there are some CSV/TSV tools can easily achieve this, including csvtk, csvkit, xcv and miller.

Here's the solution using subcommand join (usage) of csvtk:

Install using conda or directly download the binaries:

conda install csvtk

Join 1.csv to 2.csv (they are tab-separated actually, so flag -t/--tabs is on):

csvtk --tabs join --fields 'GENE;GENE1' --keep-unmatched  2.csv  1.csv

ADD COMMENT • link 7.3 years ago by shenwei356 8.4k

0

Entering edit mode

Thanks a lot!

I used csvtk, works like charm :) . But I have one question, this time a tried with two small csv tables.

      GENE  MAP DESCRIPTION
      A1BG  19q13.43    alpha
      A1BG  19q13.43    antisense
      A1CF  10q11.23    complementation
      A2M   12p13.31    alpha
      A2M   12p13.31    A2Mantisense
      A2ML1 12p13.31    alpha

      GENE  CHROM   POS
      A1BG  chr3    151545320
     AADAT  chr4    170999757
     ABCA10 chr17   67178785
     ABCA12 chr2    215884092
     ABCA2  chr9    139904734
     ABCA4  chr1    94528499

The result was:

    GENE    CHROM   POS MAP DESCRIPTION
    A1BG    chr3    151545320   19q13.43    antisense
    AADAT   chr4    170999757       
    ABCA10  chr17   67178785        
    ABCA12  chr2    215884092       
    ABCA2   chr9    139904734       
    ABCA4   chr1    94528499

As you can see A1BG has two variants, I would like to make appear both, how would be the code for that case?

Thank you so much!

ADD REPLY • link 7.3 years ago by cristina_sabiers ▴ 110

1

Entering edit mode

Sorry, it does not support this feature. Since more than one user need this, I'll add this soon.

ADD REPLY • link 7.3 years ago by shenwei356 8.4k

0

Entering edit mode

oh! Thanks! Keep us updated!!! :))))

ADD REPLY • link 7.3 years ago by cristina_sabiers ▴ 110

1

Entering edit mode

omg, I figure out before falling asleep! just change the order of files:

csvtk --tabs join --fields 1 --keep-unmatched  1.csv  2.csv

and use csvtk -t cut -f 1,4,5,2,3 to reorder columns.

good night

ADD REPLY • link 7.3 years ago by shenwei356 8.4k

0

Entering edit mode

thank you so much!

Have a good sleep!!! ^^

ADD REPLY • link 7.3 years ago by cristina_sabiers ▴ 110

1

Entering edit mode

updated: v0.4.5

ADD REPLY • link 7.3 years ago by shenwei356 8.4k