Question: Combining two columns in a tab-delimited file where there is a zero in one
0
gravatar for jvire1
7 days ago by
jvire110
jvire110 wrote:

Hi all,

I am currently trying to fill in annotations from a BLASTP of the same database into rows that are 0 in the BLASTX. The columns look like this:

BLASTX     BLASTP
annot1     annot1
annot 2    0
0          annot3

In this example I want "annot3" to be moved under the BLASTX column, so that the column contains origional BLASTX annotations and where empty gets filled by the BLASTP annotation like so:

BLASTX     BLASTP
annot1     annot1
annot 2    0
annot3     annot3

The ultimate reason for this is so that I can make a list of identifiers for downstream purposes.

I have been trying to do this manually LibreOffice, but with 160,000 sequences I realized the only practical way would be to use awk or python scripting.

rna-seq • 121 views
ADD COMMENTlink modified 7 days ago by kashifalikhan00740 • written 7 days ago by jvire110

The ultimate reason for this is so that I can make a list of identifiers for downstream purposes.

Can you show an example of the annotations, to see delimiters and possible ids?

ADD REPLYlink written 7 days ago by st.ph.n1.6k

Sure, LibreOffice is frozen at the moment the two columns are TAB delimited and the annotations within are delimited with '^'. The identifier I need is the one before the first ^, however my plan was to take the new list of annotations and open in LibreOffice, specifying it as '^' delimited and then copying the first column (the protein ID) into a new list, and then use this script:

   with open('acc_list.txt', 'r') as f:
        acc = [line.strip() for line in f]

    with open('PlantTFDB_ALL_TF_pep.fas', 'r') as f:
        for line in f:
            if line.startswith(">"):
                if line.strip().split('>')[1].split(' ')[0] in acc:
                    print line.strip().split('|')[1]

To parse out the gene name.

Then I was going to use this script to get gene family counts:

    #!/usr/bin/env python
sourcefile = "gene_name.txt"

with open(sourcefile, "r") as germ:
    germ = [item.lower().replace("\n", "") for item in germ.readlines()]
for item in sorted(set(germ)):
    print item.title(), germ.count(item)

As soon LibreOffice stop freezing I'll post an example of the two columns. I was going to at first but it didn't look pretty because the annotations were wider than the page.

best, -j

ADD REPLYlink modified 7 days ago • written 7 days ago by jvire110
1

This is messy, and you're naming your list as the same as what you're opening your file as.

with open(sourcefile, "r") as germ:
    germ = [item.lower().replace("\n", "") for item in germ.readlines()]

Try this:

with open('gene_name.txt', 'r') as source:
       germ = [line.strip().lower() for line in source]
for i in set(germ):
        print i.title(), germ.count(i)
ADD REPLYlink written 7 days ago by st.ph.n1.6k

Thank you, that is much cleaner.

ADD REPLYlink written 7 days ago by jvire110

Finally LibreOffice unfroze (wish I could force it to use all 4 cpu)!

Here is an example of the columns:

PlantTFDB_ALL_TF_pep_BLASTX PlantTFDB_ALL_TF_pep_BLASTP
Zpz_sc04830.1.g00020.1.sm.mk^Zpz_sc04830.1.g00020.1.sm.mk^Q:488-228,H:5-93^44.94%ID^E:6e-15^.^. Zpz_sc04830.1.g00020.1.sm.mk^Zpz_sc04830.1.g00020.1.sm.mk^Q:9-95,H:5-93^44.94%ID^E:2e-17^.^.
Zpz_sc04830.1.g00020.1.sm.mk^Zpz_sc04830.1.g00020.1.sm.mk^Q:157-414,H:5-92^44.32%ID^E:9e-17^.^. Zpz_sc04830.1.g00020.1.sm.mk^Zpz_sc04830.1.g00020.1.sm.mk^Q:53-138,H:5-92^44.32%ID^E:8e-17^.^.
Zpz_sc04830.1.g00020.1.sm.mk^Zpz_sc04830.1.g00020.1.sm.mk^Q:46-306,H:5-93^44.94%ID^E:1e-16^.^.  Zpz_sc04830.1.g00020.1.sm.mk^Zpz_sc04830.1.g00020.1.sm.mk^Q:16-102,H:5-93^44.94%ID^E:2e-17^.^.
Zpz_sc04830.1.g00020.1.sm.mk^Zpz_sc04830.1.g00020.1.sm.mk^Q:567-307,H:5-93^44.94%ID^E:2e-15^.^. Zpz_sc04830.1.g00020.1.sm.mk^Zpz_sc04830.1.g00020.1.sm.mk^Q:125-211,H:5-93^44.94%ID^E:2e-16^.^.
Zpz_sc04830.1.g00020.1.sm.mk^Zpz_sc04830.1.g00020.1.sm.mk^Q:488-228,H:5-93^44.94%ID^E:1e-15^.^. Zpz_sc04830.1.g00020.1.sm.mk^Zpz_sc04830.1.g00020.1.sm.mk^Q:125-211,H:5-93^44.94%ID^E:2e-16^.^.
Zpz_sc04830.1.g00020.1.sm.mk^Zpz_sc04830.1.g00020.1.sm.mk^Q:488-228,H:5-93^44.94%ID^E:1e-15^.^. Zpz_sc04830.1.g00020.1.sm.mk^Zpz_sc04830.1.g00020.1.sm.mk^Q:125-211,H:5-93^44.94%ID^E:2e-16^.^.
Zpz_sc04830.1.g00020.1.sm.mk^Zpz_sc04830.1.g00020.1.sm.mk^Q:82-360,H:1-93^84.95%ID^E:5e-51^.^.  Zpz_sc04830.1.g00020.1.sm.mk^Zpz_sc04830.1.g00020.1.sm.mk^Q:1-93,H:1-93^84.95%ID^E:2e-53^.^.
Zpz_sc04830.1.g00020.1.sm.mk^Zpz_sc04830.1.g00020.1.sm.mk^Q:681-403,H:1-93^84.95%ID^E:1e-50^.^. Zpz_sc04830.1.g00020.1.sm.mk^Zpz_sc04830.1.g00020.1.sm.mk^Q:1-93,H:1-93^84.95%ID^E:2e-53^.^.
Zpz_sc04830.1.g00020.1.sm.mk^Zpz_sc04830.1.g00020.1.sm.mk^Q:1603-1881,H:1-93^83.87%ID^E:2e-46^.^.   Zpz_sc04830.1.g00020.1.sm.mk^Zpz_sc04830.1.g00020.1.sm.mk^Q:1-93,H:1-93^83.87%ID^E:2e-52^.^.
KZV22762.1^KZV22762.1^Q:824-1537,H:1151-1396^40.96%ID^E:3e-37^.^.   Zpz_sc04699.1.g00010.1.am.mk^Zpz_sc04699.1.g00010.1.am.mk^Q:69-199,H:1-128^54.14%ID^E:3e-43^.^.
KZV22762.1^KZV22762.1^Q:811-1524,H:1151-1396^40.96%ID^E:3e-37^.^.   Zpz_sc04699.1.g00010.1.am.mk^Zpz_sc04699.1.g00010.1.am.mk^Q:69-199,H:1-128^54.14%ID^E:3e-43^.^.
KZV22762.1^KZV22762.1^Q:721-1434,H:1151-1396^40.96%ID^E:3e-37^.^.   Zpz_sc04699.1.g00010.1.am.mk^Zpz_sc04699.1.g00010.1.am.mk^Q:69-199,H:1-128^54.14%ID^E:3e-43^.^.
Zpz_sc04699.1.g00010.1.am.mk^Zpz_sc04699.1.g00010.1.am.mk^Q:7-387,H:8-116^37.8%ID^E:3e-12^.^.   Zpz_sc04699.1.g00010.1.am.mk^Zpz_sc04699.1.g00010.1.am.mk^Q:3-129,H:8-116^37.01%ID^E:1e-14^.^.
Zpz_sc04699.1.g00010.1.am.mk^Zpz_sc04699.1.g00010.1.am.mk^Q:853-551,H:27-118^37.62%ID^E:5e-14^.^.   Zpz_sc04699.1.g00010.1.am.mk^Zpz_sc04699.1.g00010.1.am.mk^Q:3-103,H:27-118^37.62%ID^E:3e-15^.^.
0   Zpz_sc04699.1.g00010.1.am.mk^Zpz_sc04699.1.g00010.1.am.mk^Q:27-128,H:32-118^32.35%ID^E:7e-06^.^.
Zpz_sc04699.1.g00010.1.am.mk^Zpz_sc04699.1.g00010.1.am.mk^Q:703-311,H:1-126^83.97%ID^E:7e-64^.^.    Zpz_sc04699.1.g00010.1.am.mk^Zpz_sc04699.1.g00010.1.am.mk^Q:174-304,H:1-126^83.97%ID^E:6e-68^.^.
ADD REPLYlink modified 7 days ago • written 7 days ago by jvire110
2
gravatar for kashifalikhan007
7 days ago by
Cologne
kashifalikhan00740 wrote:

If both column are in different files, first convert both your file in txt. Then:

paste blastx.txt blastp.txt | awk '{if($1=="0")print$2"\t"$2; else print}' > ur_output.txt

If in same file:

awk '{if($1=="0")print$2"\t"$2; else print}' input.txt > ur_output.txt
ADD COMMENTlink written 7 days ago by kashifalikhan00740

Thank you so much for this one liner and to all those who helped. Combined these scripts have allowed me to achieve my goal:

Bhlh    533
Myb_Related 397
C2H2    355
Nac 352
C3H 337
Erf 297
Myb 265
Wrky    263
B3  249
Bzip    217
Far1    196
G2-Like 167
Gras    157
M-Type_Mads 134
Trihelix    134
Arf 128
Mikc_Mads   123
Hd-Zip  120
Lbd 115
Gata    91
ADD REPLYlink modified 7 days ago • written 7 days ago by jvire110
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1393 users visited in the last hour