Question

Manipulating gene file

0

Entering edit mode

8.6 years ago

M K ▴ 660

Dear Biostars,

I have a text file contains many columns, the first column represents the repetitive DNA names with their strand and the rest of the other columns representing the gene names shard the same position with these repeats as shown below. My question here is how to manipulate this file by putting the gene names at the first column and the other columns contain the repetitive DNA names that sharing the same position with this gene.

(A)n__- Dpp10   Xkr4    Mgat4a  Ikzf2   Zfp142  Kif1a   Tmcc2   Pou2f1  Pbx1 Fbxo28 Hhat    Gm26901 Snhg6   Snord87 Tram1   Trpa1   Tram2   Lman2l  Snord89 Tex30 Myo1b Pms1    Hsfy2   Clk1    Orc2
(A)n__+ Itpkb   Tfap2d  Nyap2   Sag Ccdc93 Rc3h1    Aim2    Esrrg   Rb1cc1 St18 AC121538.1  Tfap2b  Khdc1a  Khdc1c  Imp4    Cnnm4 Mstn mmu-mir-7681 Fzd7    Nop58   Gm11602 Apol7d  Bcs1l   Ttll4   Gm21972
(ACTG)n__-  Bsnd 2900026A02Rik  6330408A02Rik   Atp2a1 Rhbdf2
(ACTG)n__+  Bpifb3  Gm16215 Trmt112-ps2 Calu    Ghrhr   Lig1    Gm22535 Podnl1 Gm16217  Sdr9c7  Slit3 Fndc9
(AGCTG)n__+ Gm25033 Gm22121 Gm22617 Gm2274 Gal3st3
(AGGGGG)n__- Gm5532 Pbx3    Dgkz    Zbp1    Lrrc34  4930503B20Rik   Padi3 Crygn Cnot6l  Gm15498 Rarres2 Gm4604  n-R5s165 Gm3912 2810047C21Rik1  Gm3654  Gm20482 Zfp27   Slco2b1 Adam32  Gm16793 Slc22a14 Gm4779 Myocd   Kdm6b
(AGGGGG)n__+    Gm24901 Pik3c2b Frmd4a  Sox2ot  Gm24830 Tpt1-ps1    Abca4 Gm13032   Gm1673  Rhoh    Mafk    Grm7    mmu-mir-7668    Ppp2cb  8030474K03Rik Lama4 Ankrd36 Igtp Irgm2  Gm12949 Tmem256 Gm24877 Mllt6   Rian    Gm17309
(ATG)n__-   Gm14264 4930533B01Rik   Arhgef10l Gm22983 Svop  Gm7887  Cecr5   9630033F20Rik   Gm27013 Gm10396 Hpn Polg P2ry6 BC051019 Gm24581 Efnb2   Ubash3b Gm8907  Tmem30a Gm14570 Gm24622 Gm23122 Myf5 BC006965   Olfr331
(ATG)n__+ Aox2  Stradb  Eif2d   Tpr Igsf8   Sh2d3c  Ypel4   Chrm4 Gm26421   Slc24a3 Nsfl1c  Gm14270 Fgg Pias3   Zcchc11 Gm11876 Fam114a1 Lias   Gm25374 Sds Rasal1 Grid2ip  Ccdc132 Gpnmb   Gm2115

For example the first gene name in this file is Dpp10 and I want to find all repetitive DNA names for it.

R • 1.1k views

ADD COMMENT • link updated 19 months ago by Ram 43k • written 8.6 years ago by M K ▴ 660

Ram · Answer 1 · 2015-09-20

0

Entering edit mode

8.6 years ago

Pierre Lindenbaum 161k

Not R, but using R and sqlite3

tr "\t" " " < input.txt |\
tr -s " " | \
awk -F ' ' 'BEGIN{printf("create table if not exists T(x text,y text); begin transaction;\n");} { for(i=2;i<=NF;++i) printf("insert into T(x,y) values (\"%s\",\"%s\");\n",$1,$i);} END {printf("commit;\nselect T.y,group_concat(T.x) from T group by y;\n");}' |\
sqlite3 tmp.sqlite3 && rm tmp.sqlite3

ADD COMMENT • link updated 19 months ago by Ram 43k • written 8.6 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

Dear Pierre, thanks for responding me, but I don't have any idea about sqlite3. so is there any way to do that using only R.

ADD REPLY • link 8.6 years ago by M K ▴ 660