Question: (Closed) data file format
0
gravatar for ahmedakhokhar
3.8 years ago by
ahmedakhokhar110
Belgium
ahmedakhokhar110 wrote:

I am working with a tab separated files:

A    B    C   D

a    d    ii  domain 

a    d    g domain

a    h     g domain

a     i     k motif

c     i      k motif

c     g     ii motif

v     g      p domain

Question: I want to count each entry in first column and all related entry to it in second, third and fourth column like:

a 4 d 2 h 1 i 1 ii 1 k 1 domain 3 motif 1

c 2 i 1 g 1 k 1 ii 1 motif 2 

v 1 g 1 p 1 motif 1

I am trying to sort this data with python pandas by these commands:

df = pd.read_csv('file.txt', delimiter= '\t', names = ['A', 'B', 'C', 'D']) 

df1.groupby(['a', 'c', 'd', 'e']).count()

but it does not return the desired results.

Any help would be appreciated, thanks.

pandas python • 931 views
ADD COMMENTlink modified 3.8 years ago by Steven Lakin1.5k • written 3.8 years ago by ahmedakhokhar110

Hello ahmedakhokhar!

We believe that this post does not fit the main topic of this site.

This is not a bioinformatics question. Please note: This is your second time posting such a question.

For this reason we have closed your question. This allows us to keep the site focused on the topics that the community can help with.

If you disagree please tell us why in a reply below, we'll be happy to talk about it.

Cheers!

ADD REPLYlink written 3.8 years ago by RamRS26k
0
gravatar for Steven Lakin
3.8 years ago by
Steven Lakin1.5k
Fort Collins, CO, USA
Steven Lakin1.5k wrote:

This code is functional but neither pretty nor optimized for any purpose:

Counter will store counts of items. You can store the counters in a dictionary where the key is the value of the first column.

from collections import Counter
import sys
answer = {}

with open("path/to/file.txt", "r") as f:
    data = f.read().split("\n")
    for line in data[1:]:
        if line:
            for column in line.split():
                try:
                    answer[line[0]] += Counter((column,))
                except KeyError:
                    answer.setdefault(line[0], Counter((column,)))

for entry in answer.values():
    for name, count in entry.items():
        sys.stdout.write("{} {} ".format(name, count))
    sys.stdout.write('\n')

Output is like so:

a 4 domain 3 motif 1 d 2 g 2 i 1 h 1 k 1 ii 1
c 2 motif 2 g 1 i 1 k 1 ii 1
p 1 domain 1 g 1 v 1

I'd be interested to see a 100% pandas answer to this though.

ADD COMMENTlink written 3.8 years ago by Steven Lakin1.5k

@Steven Lakin: Thanks for your answer, its very near to what I'm looking for, but the results are not in desired format, they appeared as,

a 4

domain 3

motif 1

d 2

g 2

i 1

h 1

k 1

ii 1

c 2

motif 2

g 1

i 1

k 1

ii 1

Any suggestions to get the results as mentioned in the initial question ?

ADD REPLYlink modified 3.8 years ago • written 3.8 years ago by ahmedakhokhar110

That's odd. Did you use sys.stdout.write? Print will add a newline by default, but stdout write shouldn't. Make sure the indents are correct as in the above code, since it should only add a newline after each outer loop.

ADD REPLYlink written 3.8 years ago by Steven Lakin1.5k

Thanks Steven it works as you suggested, but there is a slite change in the output format still, instead of showing instead of counting entries for each entry in first column like:

a 4 domain 3 motif 1 d 2 g 2 i 1 h 1 k 1 ii 1

c 2 motif 2 g 1 i 1 k 1 ii 1

p 1 domain 1 g 1 v 1

it gives something like this:

a 4 domain 3 motif 1 d 2 g 2 h 1 k 1k 1 ii 1 c 2 motif 2 g 1 g 1 i 1 i 1 ii 1

p 1 domain 1 v 1

What I want is to count each row in the first column and count each row is the subsequent columns that correspond to the entries in the first columns as mentioned in the initial question, any suggestions ??

ADD REPLYlink modified 3.8 years ago • written 3.8 years ago by ahmedakhokhar110
Please log in to add an answer.
The thread is closed. No new answers may be added.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1271 users visited in the last hour