How can I make the loop to count the gene against query id
1
0
Entering edit mode
8.7 years ago
tcf.hcdg ▴ 70

I have the data frame in R with 14 columns and 4.4 million rows.

Column 1 has the query id and column 4 has the gene name.

I want to make the data fram that can show the which and how many genes corresponding to each query id.

I have 44K different query ids and each query have maximum ~100 genes hit

CSAI_contig04661_6     sp     O65396     GCST      ARATH     86.03     408     56      1      72          478     1       408     0.0e+00     738.0
CSAI_contig04661_6     sp     Q681Y3     Y1099     ARATH     22.55     337     244     10     140         474     103     424     8.0e-09     56.6
CSAI_contig04661_6     sp     Q9FLR5     SMC6A     ARATH     24.27     103     66      3      04. Jun     249     342     441     4.6e+00     28. Sep
CSAI_contig04661_6     sp     Q9LQI7     GCST      ARATH     24.28     74      47      2      17. Aug     300     31      100     8.1e+00     27. Jul
CSAI_contig04661_6     sp     P56795     RK22      ARATH     28.95     76      49      4      11. Mrz     509     15      87      8.4e+00     27. Mrz
CSAI_isotig00001_4     sp     Q8VZE4     PP299     ARATH     29.63     108     55      5      31. Jul     307     10      109     1.6e+00     30. Apr

I am interested in this type of output.

CSAI_contig04661_6                GCST       2
                                  Y1099      1
                                  SMC6A      1
                                  RK22       1

How can I make a loop that check the column 1 until they have same query (for example in this example it has 6) and then go to the column 4 and find how many genes are present and count their number if more then one (in this example against first query GCST is present 2 times)

loop • 1.8k views
ADD COMMENT
0
Entering edit mode

Did you try grep in Linux?

ADD REPLY
0
Entering edit mode

I tried it in R with the following:

group_by(t38kbat, query_id, gene) %>% summarise(n())

I received the output in this form

query_id  gene n()
1  CSAI_contig04661_6  GCST   3
2  CSAI_contig04661_6 SMC6A   1
3  CSAI_contig04661_6 Y1099   1
4  CSAI_isotig00001_4 AMSH3   1
5  CSAI_isotig00001_4 C98A9   1
6  CSAI_isotig00001_4 MOB2A   1
7  CSAI_isotig00001_4 PP299   1
8  CSAI_isotig00001_4  QORL   1
9  CSAI_isotig00001_4 WAKLP   1
10 CSAI_isotig00004_3  GCST   1
..                ...   ... ...

I want to print query id only one . For example

CSAI_contig04661_6
                                               GCST   3
                                               SMC6A   1
                                               Y1099   1

CSAI_isotig00001_4
                                               AMSH3   1
                                                C98A9   1
                                                MOB2A   1
                                                PP299   1
                                                QORL   1
                                                WAKLP   1
ADD REPLY
0
Entering edit mode
8.7 years ago
glihm ▴ 660

Hi,

This can be a solution:

#A little vector to count occurrences, initialized to 1.
count <- c(rep(1,length(data$geneName)))

#A data frame with the columns of interest.
df <- data.frame(data$geneID, data$geneName, count)

#Function AGGREGATE, useful in R. The function SUM is applied to count when geneID #match with geneName
ag <- aggregate(count ~ ., data = df, FUN = sum)

We you can (as possible!), avoid loop in R. ;)

ADD COMMENT

Login before adding your answer.

Traffic: 2955 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6