Question

Combining two columns in a tab-delimited file where there is a zero in one

0

Entering edit mode

7.8 years ago

jvire1 ▴ 10

Hi all,

I am currently trying to fill in annotations from a BLASTP of the same database into rows that are 0 in the BLASTX. The columns look like this:

BLASTX     BLASTP
annot1     annot1
annot 2    0
0          annot3

In this example I want "annot3" to be moved under the BLASTX column, so that the column contains origional BLASTX annotations and where empty gets filled by the BLASTP annotation like so:

BLASTX     BLASTP
annot1     annot1
annot 2    0
annot3     annot3

The ultimate reason for this is so that I can make a list of identifiers for downstream purposes.

I have been trying to do this manually LibreOffice, but with 160,000 sequences I realized the only practical way would be to use awk or python scripting.

RNA-Seq • 2.3k views

ADD COMMENT • link updated 7.8 years ago by kashiff007 ★ 1.9k • written 7.8 years ago by jvire1 ▴ 10

0

Entering edit mode

The ultimate reason for this is so that I can make a list of identifiers for downstream purposes.

Can you show an example of the annotations, to see delimiters and possible ids?

ADD REPLY • link 7.8 years ago by st.ph.n ★ 2.7k

0

Entering edit mode

Sure, LibreOffice is frozen at the moment the two columns are TAB delimited and the annotations within are delimited with '^'. The identifier I need is the one before the first ^, however my plan was to take the new list of annotations and open in LibreOffice, specifying it as '^' delimited and then copying the first column (the protein ID) into a new list, and then use this script:

   with open('acc_list.txt', 'r') as f:
        acc = [line.strip() for line in f]

    with open('PlantTFDB_ALL_TF_pep.fas', 'r') as f:
        for line in f:
            if line.startswith(">"):
                if line.strip().split('>')[1].split(' ')[0] in acc:
                    print line.strip().split('|')[1]

To parse out the gene name.

Then I was going to use this script to get gene family counts:

    #!/usr/bin/env python
sourcefile = "gene_name.txt"

with open(sourcefile, "r") as germ:
    germ = [item.lower().replace("\n", "") for item in germ.readlines()]
for item in sorted(set(germ)):
    print item.title(), germ.count(item)

As soon LibreOffice stop freezing I'll post an example of the two columns. I was going to at first but it didn't look pretty because the annotations were wider than the page.

best, -j

ADD REPLY • link 7.8 years ago by jvire1 ▴ 10

1

Entering edit mode

This is messy, and you're naming your list as the same as what you're opening your file as.

with open(sourcefile, "r") as germ:
    germ = [item.lower().replace("\n", "") for item in germ.readlines()]

Try this:

with open('gene_name.txt', 'r') as source:
       germ = [line.strip().lower() for line in source]
for i in set(germ):
        print i.title(), germ.count(i)

ADD REPLY • link 7.8 years ago by st.ph.n ★ 2.7k

0

Entering edit mode

Thank you, that is much cleaner.

ADD REPLY • link 7.8 years ago by jvire1 ▴ 10

0

Entering edit mode

Finally LibreOffice unfroze (wish I could force it to use all 4 cpu)!

Here is an example of the columns:

PlantTFDB_ALL_TF_pep_BLASTX PlantTFDB_ALL_TF_pep_BLASTP
Zpz_sc04830.1.g00020.1.sm.mk^Zpz_sc04830.1.g00020.1.sm.mk^Q:488-228,H:5-93^44.94%ID^E:6e-15^.^. Zpz_sc04830.1.g00020.1.sm.mk^Zpz_sc04830.1.g00020.1.sm.mk^Q:9-95,H:5-93^44.94%ID^E:2e-17^.^.
Zpz_sc04830.1.g00020.1.sm.mk^Zpz_sc04830.1.g00020.1.sm.mk^Q:157-414,H:5-92^44.32%ID^E:9e-17^.^. Zpz_sc04830.1.g00020.1.sm.mk^Zpz_sc04830.1.g00020.1.sm.mk^Q:53-138,H:5-92^44.32%ID^E:8e-17^.^.
Zpz_sc04830.1.g00020.1.sm.mk^Zpz_sc04830.1.g00020.1.sm.mk^Q:46-306,H:5-93^44.94%ID^E:1e-16^.^.  Zpz_sc04830.1.g00020.1.sm.mk^Zpz_sc04830.1.g00020.1.sm.mk^Q:16-102,H:5-93^44.94%ID^E:2e-17^.^.
Zpz_sc04830.1.g00020.1.sm.mk^Zpz_sc04830.1.g00020.1.sm.mk^Q:567-307,H:5-93^44.94%ID^E:2e-15^.^. Zpz_sc04830.1.g00020.1.sm.mk^Zpz_sc04830.1.g00020.1.sm.mk^Q:125-211,H:5-93^44.94%ID^E:2e-16^.^.
Zpz_sc04830.1.g00020.1.sm.mk^Zpz_sc04830.1.g00020.1.sm.mk^Q:488-228,H:5-93^44.94%ID^E:1e-15^.^. Zpz_sc04830.1.g00020.1.sm.mk^Zpz_sc04830.1.g00020.1.sm.mk^Q:125-211,H:5-93^44.94%ID^E:2e-16^.^.
Zpz_sc04830.1.g00020.1.sm.mk^Zpz_sc04830.1.g00020.1.sm.mk^Q:488-228,H:5-93^44.94%ID^E:1e-15^.^. Zpz_sc04830.1.g00020.1.sm.mk^Zpz_sc04830.1.g00020.1.sm.mk^Q:125-211,H:5-93^44.94%ID^E:2e-16^.^.
Zpz_sc04830.1.g00020.1.sm.mk^Zpz_sc04830.1.g00020.1.sm.mk^Q:82-360,H:1-93^84.95%ID^E:5e-51^.^.  Zpz_sc04830.1.g00020.1.sm.mk^Zpz_sc04830.1.g00020.1.sm.mk^Q:1-93,H:1-93^84.95%ID^E:2e-53^.^.
Zpz_sc04830.1.g00020.1.sm.mk^Zpz_sc04830.1.g00020.1.sm.mk^Q:681-403,H:1-93^84.95%ID^E:1e-50^.^. Zpz_sc04830.1.g00020.1.sm.mk^Zpz_sc04830.1.g00020.1.sm.mk^Q:1-93,H:1-93^84.95%ID^E:2e-53^.^.
Zpz_sc04830.1.g00020.1.sm.mk^Zpz_sc04830.1.g00020.1.sm.mk^Q:1603-1881,H:1-93^83.87%ID^E:2e-46^.^.   Zpz_sc04830.1.g00020.1.sm.mk^Zpz_sc04830.1.g00020.1.sm.mk^Q:1-93,H:1-93^83.87%ID^E:2e-52^.^.
KZV22762.1^KZV22762.1^Q:824-1537,H:1151-1396^40.96%ID^E:3e-37^.^.   Zpz_sc04699.1.g00010.1.am.mk^Zpz_sc04699.1.g00010.1.am.mk^Q:69-199,H:1-128^54.14%ID^E:3e-43^.^.
KZV22762.1^KZV22762.1^Q:811-1524,H:1151-1396^40.96%ID^E:3e-37^.^.   Zpz_sc04699.1.g00010.1.am.mk^Zpz_sc04699.1.g00010.1.am.mk^Q:69-199,H:1-128^54.14%ID^E:3e-43^.^.
KZV22762.1^KZV22762.1^Q:721-1434,H:1151-1396^40.96%ID^E:3e-37^.^.   Zpz_sc04699.1.g00010.1.am.mk^Zpz_sc04699.1.g00010.1.am.mk^Q:69-199,H:1-128^54.14%ID^E:3e-43^.^.
Zpz_sc04699.1.g00010.1.am.mk^Zpz_sc04699.1.g00010.1.am.mk^Q:7-387,H:8-116^37.8%ID^E:3e-12^.^.   Zpz_sc04699.1.g00010.1.am.mk^Zpz_sc04699.1.g00010.1.am.mk^Q:3-129,H:8-116^37.01%ID^E:1e-14^.^.
Zpz_sc04699.1.g00010.1.am.mk^Zpz_sc04699.1.g00010.1.am.mk^Q:853-551,H:27-118^37.62%ID^E:5e-14^.^.   Zpz_sc04699.1.g00010.1.am.mk^Zpz_sc04699.1.g00010.1.am.mk^Q:3-103,H:27-118^37.62%ID^E:3e-15^.^.
0   Zpz_sc04699.1.g00010.1.am.mk^Zpz_sc04699.1.g00010.1.am.mk^Q:27-128,H:32-118^32.35%ID^E:7e-06^.^.
Zpz_sc04699.1.g00010.1.am.mk^Zpz_sc04699.1.g00010.1.am.mk^Q:703-311,H:1-126^83.97%ID^E:7e-64^.^.    Zpz_sc04699.1.g00010.1.am.mk^Zpz_sc04699.1.g00010.1.am.mk^Q:174-304,H:1-126^83.97%ID^E:6e-68^.^.

ADD REPLY • link 7.8 years ago by jvire1 ▴ 10

score 3 · Accepted Answer · 2017-09-13

3

Entering edit mode

7.8 years ago

kashiff007 ★ 1.9k

If both column are in different files, first convert both your file in txt. Then:

paste blastx.txt blastp.txt | awk '{if($1=="0")print$2"\t"$2; else print}' > ur_output.txt

If in same file:

awk '{if($1=="0")print$2"\t"$2; else print}' input.txt > ur_output.txt

ADD COMMENT • link 7.8 years ago by kashiff007 ★ 1.9k

0

Entering edit mode

Thank you so much for this one liner and to all those who helped. Combined these scripts have allowed me to achieve my goal:

Bhlh    533
Myb_Related 397
C2H2    355
Nac 352
C3H 337
Erf 297
Myb 265
Wrky    263
B3  249
Bzip    217
Far1    196
G2-Like 167
Gras    157
M-Type_Mads 134
Trihelix    134
Arf 128
Mikc_Mads   123
Hd-Zip  120
Lbd 115
Gata    91

ADD REPLY • link 7.8 years ago by jvire1 ▴ 10