Question: How to match genes of one file to another but in multiple columns
0
gravatar for bweil2
4 months ago by
bweil20
bweil20 wrote:

I want to see which genes in archloro.txt match both lines of ArabidopsisParalogue.txt. I tried grep -f but that would just find the genes in both files have in common and not 2 genes that are in the same line. Any ideas on how to do this?

Basically i want to see how many and which rows in ArarbidopsisParalogue.txt match with 2 genes from archloro.txt.

head archloro.txt

AT1G01080
AT1G01090
AT1G01100
AT1G01110
AT1G01250
AT1G01550
AT1G01690
AT1G01730
AT1G01860
AT1G01950
AT1G02060
AT1G21400
AT1G24180
AT1G59900
AT1G17480
AT5G09300
AT5G34780
AT5G50250

head ArabidopsisParalogue.txt

AT1G01080       AT5G50250
AT1G01090       AT1G21400
AT1G01090       AT1G24180
AT1G01090       AT1G59900
AT1G01090       AT5G09300
AT1G01090       AT5G34780
AT1G01100       AT4G00810
AT1G01100       AT5G24510
AT1G01100       AT5G47700
AT1G01110       AT1G17480

Expected output

AT1G01080      AT5G50250
AT1G01090      AT1G21400
AT1G01090      AT1G24180
AT1G01090      AT1G59900
AT1G01090      AT5G09300
AT1G01090      AT5G34780
AT1G01100      AT1G17480
alignment sequence gene genome • 692 views
ADD COMMENTlink modified 3 months ago • written 4 months ago by bweil20
2

It's not clear to me from your question and example data how a gene ID from archloro.txt can match both IDs in one row of ArabidopsisParalogue.txt, when the IDs in the latter file are not identical. Could you clarify the question with some expected output?

ADD REPLYlink written 4 months ago by Neilfws48k

due to formatting issues, I could include head ArabidopsisParalogue.txt. There are two columns of genes.

ADD REPLYlink written 4 months ago by bweil20

These commands are not working for me!

ADD REPLYlink written 4 months ago by bweil20
1

Unless you show data from both files, one can't give any suggestions.

BTW, please move this answer to comment as it is not an answer.

ADD REPLYlink written 4 months ago by venu5.4k

The last line of your expected output is incorrect because AT1G17480 is not present in archloro.txt (in the data that you've shown)

ADD REPLYlink written 3 months ago by Kevin Blighe24k

and AT5G34780 is present twice in first file.

ADD REPLYlink written 3 months ago by cpad01127.7k

Yep (and good to see you again).

Your solution also works, it seems. bweil2, if you could devote a few moments more to looking at the other 2 new answers to see if they work (and Accept them also, if so), then that would be great.

ADD REPLYlink written 3 months ago by Kevin Blighe24k
1

thanks and you seem to be in good spirits always @kevin

ADD REPLYlink written 3 months ago by cpad01127.7k
4
gravatar for Damian Kao
3 months ago by
Damian Kao14k
USA
Damian Kao14k wrote:

You can use this python script:

match = set(open('archloro.txt').read().strip().split('\n'))
for line in open('ArabidopsisParalogue.txt'):
    a,b = line.strip().split('\t')
    if a in match and b in match:
        print line.strip()

Save as match.py run by python match.py

ADD COMMENTlink written 3 months ago by Damian Kao14k
1

For the record following python code works fine as well. I think @Damian forgot to add brackets in the print line.

match = set(open('archloro.txt').read().strip().split('\n'))
for line in open('ArabidopsisParalogue.txt'):
    a,b = line.strip().split('\t')
    if a in match and b in match:
        print(line.strip())
ADD REPLYlink modified 3 months ago • written 3 months ago by genomax52k

Damian wrote Python2.

ADD REPLYlink written 3 months ago by WouterDeCoster30k

I guess there are people in the field that learned python from v3 onwards. I feel like a dinosaur now.

ADD REPLYlink written 3 months ago by Damian Kao14k

Yeah, I started out on 2.7 but realised quickly enough that I should write Python3 so now I'm used to that.

ADD REPLYlink modified 3 months ago • written 3 months ago by WouterDeCoster30k
3
gravatar for Kevin Blighe
3 months ago by
Kevin Blighe24k
Republic of Ireland
Kevin Blighe24k wrote:

I think that this may be it:

awk 'FNR==NR {idx[$1]=$1; next} {print idx[$(1)]"\t"$2}' FS=' ' archloro.txt FS=' ' ArabidopsisParalogue.txt
AT1G01080   AT5G50250
AT1G01090   AT1G21400
AT1G01090   AT1G24180
AT1G01090   AT1G59900
AT1G01090   AT5G09300
AT1G01090   AT5G34780
AT1G01100   AT4G00810
AT1G01100   AT5G24510
AT1G01100   AT5G47700
AT1G01110   AT1G17480

ADD COMMENTlink modified 3 months ago by Ram16k • written 3 months ago by Kevin Blighe24k
1

These are lines from ArabidopsisParalogue.txt in which both genes are also present in ararchloro.txt:

awk 'FNR==NR {idx[$1]=$1; next} {if (idx[$1] && idx[$2]) print}' FS=' ' archloro.txt FS=' ' ArabidopsisParalogue.txt

AT1G01080       AT5G50250
AT1G01090       AT1G21400
AT1G01090       AT1G24180
AT1G01090       AT1G59900
AT1G01090       AT5G09300
AT1G01090       AT5G34780
ADD REPLYlink modified 3 months ago • written 3 months ago by Kevin Blighe24k

Upon further review, I am getting the same number of a particular gene from one column in the other. For example, AT1G01090 shows up 5 times in the left column. But further down the right column it shows up 5 times matching with the same genes just inverted.

AT1G01080       AT5G50250
AT1G01090       AT1G21400
AT1G01090       AT1G24180
AT1G01090       AT1G59900
AT1G01090       AT5G09300
AT1G01090       AT5G34780

Further down, (not sorted)

AT5G50250       AT1G01080
AT1G21400       AT1G01090
AT1G24180       AT1G01090
AT1G59900       AT1G01090
AT5G09300       AT1G01090
AT5G34780       AT1G01090

We know all these genes are targeted to the chloroplast from your script but we're getting some duplicates. I know I can just cut these into two files and than cat them together to get the unique number of genes to find the line count. But I need a way to align them back up the original way. In other words, I need a code that will take out the reciprocal genes matches.

ADD REPLYlink written 3 months ago by bweil20

I'd switch to R if I were you.

ADD REPLYlink written 3 months ago by Ram16k

You may want to try this, but please rigorously check this:

awk 'FNR==NR {idx[$1]=$1; next} {if (idx[$1] && idx[$2]) {print}}' FS=' ' archloro.txt FS=' ' ArabidopsisParalogue.txt | awk '{print $0"\t"$1$2"\t"$2$1}' | awk '{if (!duplicate[$3]++ && !duplicate[$4]++) {print $1"\t"$2}}'

This works by creating a unique key (for each record) that is based on Col1+Col2 and also Col2+Col1

ADD REPLYlink modified 3 months ago • written 3 months ago by Kevin Blighe24k
2
gravatar for cpad0112
3 months ago by
cpad01127.7k
India
cpad01127.7k wrote:
$ sort -k2 ArabidopsisParalogue.txt | join -t $'\t' -1 2 -2 1 - archloro.txt | sort -k2| join -t $'\t'  -1 1 -2 2  archloro.txt -| uniq
AT1G01080   AT5G50250
AT1G01090   AT1G21400
AT1G01090   AT1G24180
AT1G01090   AT1G59900
AT1G01090   AT5G09300
AT1G01090   AT5G34780
ADD COMMENTlink modified 3 months ago • written 3 months ago by cpad01127.7k
1
gravatar for venu
4 months ago by
venu5.4k
Germany
venu5.4k wrote:

I tried grep -f but that would just find the genes in both files have in common and not 2 genes

 after grep command | awk -F'\t' '$1==$2'

above works if ArabidopsisParalogue.txt is tab delimited file. If not change -F'\t'.

ADD COMMENTlink written 4 months ago by venu5.4k

Yes it is tab delimited but its still not working. This is the input

grep -f Achloroplast.txt ArabidopsisParalogues.txt |  awk -F'\t' '$1==$2' >> MatchingParalogues.txt
ADD REPLYlink written 4 months ago by bweil20

When you tried/implemented some answer from any forum, make sure you are updating with the result, i.e whether the solution worked or not?

P.S: It didn't work if I'm correct. Try this

cat Achloroplast.txt | cut -d "|" -f1 | sort -u | grep -Fwf - ArabidopsisParalogues.txt | awk -F'\t' '$1==$2'
ADD REPLYlink modified 4 months ago • written 4 months ago by venu5.4k

Did not work

Now it says grep: -: No such file or directory

So, I took out the "-" before ArabidopsisParalogues.txt

Getting no values with that

ADD REPLYlink written 4 months ago by bweil20
1
gravatar for robertobfisher
4 months ago by
robertobfisher30 wrote:

You could write a short script in a language like Python. Pseudocode:

f1 = archoloro.txt
f2 = ArabidopsisParalogue.txt

s = construct set from f1
t = empty set

for each line in f2
    g1, g2 = split line at "\t"

    if g1 and g2 in s
        add g1, g2 to s

If your file is large, there's two optimizations you can do:

  • Specifically use a HashSet or similar structure that permits low-cost test of membership
  • Before checking if g1 and g2 are in s, check whether they are both already in t (which should be much smaller) first
ADD COMMENTlink written 4 months ago by robertobfisher30

I've only been using terminal on my mac. I do have R but I never used python.

ADD REPLYlink written 4 months ago by bweil20

Unfortunately I don't know many terminal commands so I wrote pseudocode instead. Hopefully you can convert it to a language you're familiar with.

ADD REPLYlink written 3 months ago by robertobfisher30

I've figured out the exact code for all of them except for "add g1, g2 to s." Is this just adding two files to another? Any ideas on the exact command?

ADD REPLYlink written 4 months ago by bweil20

s here should be a data structure that resembles a mathematical set (all elements unique). In Python, you can do s = set() to create an empty set and add elements with s.add(g1) and s.add(g2).

ADD REPLYlink written 3 months ago by robertobfisher30
1
gravatar for igor
3 months ago by
igor6.4k
United States
igor6.4k wrote:

I think you are looking for the join command. You can do something like: join archloro.txt ArabidopsisParalogue.txt.

More info and examples here: http://www.theunixschool.com/2012/01/join-command.html

ADD COMMENTlink written 3 months ago by igor6.4k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 521 users visited in the last hour