Question: How to match genes of one file to another but in multiple columns
0
gravatar for bweil2
6 weeks ago by
bweil20
bweil20 wrote:

I want to see which genes in archloro.txt match both lines of ArabidopsisParalogue.txt. I tried grep -f but that would just find the genes in both files have in common and not 2 genes that are in the same line. Any ideas on how to do this?

Basically i want to see how many and which rows in ArarbidopsisParalogue.txt match with 2 genes from archloro.txt.

head archloro.txt

AT1G01080
AT1G01090
AT1G01100
AT1G01110
AT1G01250
AT1G01550
AT1G01690
AT1G01730
AT1G01860
AT1G01950
AT1G02060
AT1G21400
AT1G24180
AT1G59900
AT1G17480
AT5G09300
AT5G34780
AT5G50250

head ArabidopsisParalogue.txt

AT1G01080       AT5G50250
AT1G01090       AT1G21400
AT1G01090       AT1G24180
AT1G01090       AT1G59900
AT1G01090       AT5G09300
AT1G01090       AT5G34780
AT1G01100       AT4G00810
AT1G01100       AT5G24510
AT1G01100       AT5G47700
AT1G01110       AT1G17480

Expected output

AT1G01080      AT5G50250
AT1G01090      AT1G21400
AT1G01090      AT1G24180
AT1G01090      AT1G59900
AT1G01090      AT5G09300
AT1G01090      AT5G34780
AT1G01100      AT1G17480
alignment sequence gene genome • 532 views
ADD COMMENTlink modified 29 days ago • written 6 weeks ago by bweil20
2

It's not clear to me from your question and example data how a gene ID from archloro.txt can match both IDs in one row of ArabidopsisParalogue.txt, when the IDs in the latter file are not identical. Could you clarify the question with some expected output?

ADD REPLYlink written 5 weeks ago by Neilfws47k

due to formatting issues, I could include head ArabidopsisParalogue.txt. There are two columns of genes.

ADD REPLYlink written 6 weeks ago by bweil20

These commands are not working for me!

ADD REPLYlink written 5 weeks ago by bweil20
1

Unless you show data from both files, one can't give any suggestions.

BTW, please move this answer to comment as it is not an answer.

ADD REPLYlink written 5 weeks ago by venu5.0k

The last line of your expected output is incorrect because AT1G17480 is not present in archloro.txt (in the data that you've shown)

ADD REPLYlink written 29 days ago by Kevin Blighe17k

and AT5G34780 is present twice in first file.

ADD REPLYlink written 29 days ago by cpad01124.5k

Yep (and good to see you again).

Your solution also works, it seems. bweil2, if you could devote a few moments more to looking at the other 2 new answers to see if they work (and Accept them also, if so), then that would be great.

ADD REPLYlink written 29 days ago by Kevin Blighe17k
1

thanks and you seem to be in good spirits always @kevin

ADD REPLYlink written 29 days ago by cpad01124.5k
4
gravatar for Damian Kao
29 days ago by
Damian Kao14k
USA
Damian Kao14k wrote:

You can use this python script:

match = set(open('archloro.txt').read().strip().split('\n'))
for line in open('ArabidopsisParalogue.txt'):
    a,b = line.strip().split('\t')
    if a in match and b in match:
        print line.strip()

Save as match.py run by python match.py

ADD COMMENTlink written 29 days ago by Damian Kao14k
1

For the record following python code works fine as well. I think @Damian forgot to add brackets in the print line.

match = set(open('archloro.txt').read().strip().split('\n'))
for line in open('ArabidopsisParalogue.txt'):
    a,b = line.strip().split('\t')
    if a in match and b in match:
        print(line.strip())
ADD REPLYlink modified 29 days ago • written 29 days ago by genomax46k

Damian wrote Python2.

ADD REPLYlink written 28 days ago by WouterDeCoster27k

I guess there are people in the field that learned python from v3 onwards. I feel like a dinosaur now.

ADD REPLYlink written 28 days ago by Damian Kao14k

Yeah, I started out on 2.7 but realised quickly enough that I should write Python3 so now I'm used to that.

ADD REPLYlink modified 28 days ago • written 28 days ago by WouterDeCoster27k
3
gravatar for Kevin Blighe
29 days ago by
Kevin Blighe17k
University College London Cancer Institute
Kevin Blighe17k wrote:

I think that this may be it:

awk 'FNR==NR {idx[$1]=$1; next} {print idx[$(1)]"\t"$2}' FS=' ' archloro.txt FS=' ' ArabidopsisParalogue.txt
AT1G01080   AT5G50250
AT1G01090   AT1G21400
AT1G01090   AT1G24180
AT1G01090   AT1G59900
AT1G01090   AT5G09300
AT1G01090   AT5G34780
AT1G01100   AT4G00810
AT1G01100   AT5G24510
AT1G01100   AT5G47700
AT1G01110   AT1G17480

ADD COMMENTlink modified 29 days ago by Ram14k • written 29 days ago by Kevin Blighe17k
1

These are lines from ArabidopsisParalogue.txt in which both genes are also present in ararchloro.txt:

awk 'FNR==NR {idx[$1]=$1; next} {if (idx[$1] && idx[$2]) print}' FS=' ' archloro.txt FS=' ' ArabidopsisParalogue.txt

AT1G01080       AT5G50250
AT1G01090       AT1G21400
AT1G01090       AT1G24180
AT1G01090       AT1G59900
AT1G01090       AT5G09300
AT1G01090       AT5G34780
ADD REPLYlink modified 29 days ago • written 29 days ago by Kevin Blighe17k

Upon further review, I am getting the same number of a particular gene from one column in the other. For example, AT1G01090 shows up 5 times in the left column. But further down the right column it shows up 5 times matching with the same genes just inverted.

AT1G01080       AT5G50250
AT1G01090       AT1G21400
AT1G01090       AT1G24180
AT1G01090       AT1G59900
AT1G01090       AT5G09300
AT1G01090       AT5G34780

Further down, (not sorted)

AT5G50250       AT1G01080
AT1G21400       AT1G01090
AT1G24180       AT1G01090
AT1G59900       AT1G01090
AT5G09300       AT1G01090
AT5G34780       AT1G01090

We know all these genes are targeted to the chloroplast from your script but we're getting some duplicates. I know I can just cut these into two files and than cat them together to get the unique number of genes to find the line count. But I need a way to align them back up the original way. In other words, I need a code that will take out the reciprocal genes matches.

ADD REPLYlink written 23 days ago by bweil20

I'd switch to R if I were you.

ADD REPLYlink written 23 days ago by Ram14k

You may want to try this, but please rigorously check this:

awk 'FNR==NR {idx[$1]=$1; next} {if (idx[$1] && idx[$2]) {print}}' FS=' ' archloro.txt FS=' ' ArabidopsisParalogue.txt | awk '{print $0"\t"$1$2"\t"$2$1}' | awk '{if (!duplicate[$3]++ && !duplicate[$4]++) {print $1"\t"$2}}'

This works by creating a unique key (for each record) that is based on Col1+Col2 and also Col2+Col1

ADD REPLYlink modified 22 days ago • written 22 days ago by Kevin Blighe17k
2
gravatar for cpad0112
29 days ago by
cpad01124.5k
cpad01124.5k wrote:
$ sort -k2 ArabidopsisParalogue.txt | join -t $'\t' -1 2 -2 1 - archloro.txt | sort -k2| join -t $'\t'  -1 1 -2 2  archloro.txt -| uniq
AT1G01080   AT5G50250
AT1G01090   AT1G21400
AT1G01090   AT1G24180
AT1G01090   AT1G59900
AT1G01090   AT5G09300
AT1G01090   AT5G34780
ADD COMMENTlink modified 29 days ago • written 29 days ago by cpad01124.5k
1
gravatar for venu
6 weeks ago by
venu5.0k
Germany
venu5.0k wrote:

I tried grep -f but that would just find the genes in both files have in common and not 2 genes

 after grep command | awk -F'\t' '$1==$2'

above works if ArabidopsisParalogue.txt is tab delimited file. If not change -F'\t'.

ADD COMMENTlink written 6 weeks ago by venu5.0k

Yes it is tab delimited but its still not working. This is the input

grep -f Achloroplast.txt ArabidopsisParalogues.txt |  awk -F'\t' '$1==$2' >> MatchingParalogues.txt
ADD REPLYlink written 5 weeks ago by bweil20

When you tried/implemented some answer from any forum, make sure you are updating with the result, i.e whether the solution worked or not?

P.S: It didn't work if I'm correct. Try this

cat Achloroplast.txt | cut -d "|" -f1 | sort -u | grep -Fwf - ArabidopsisParalogues.txt | awk -F'\t' '$1==$2'
ADD REPLYlink modified 5 weeks ago • written 5 weeks ago by venu5.0k

Did not work

Now it says grep: -: No such file or directory

So, I took out the "-" before ArabidopsisParalogues.txt

Getting no values with that

ADD REPLYlink written 5 weeks ago by bweil20
1
gravatar for robertobfisher
5 weeks ago by
robertobfisher10 wrote:

You could write a short script in a language like Python. Pseudocode:

f1 = archoloro.txt
f2 = ArabidopsisParalogue.txt

s = construct set from f1
t = empty set

for each line in f2
    g1, g2 = split line at "\t"

    if g1 and g2 in s
        add g1, g2 to s

If your file is large, there's two optimizations you can do:

  • Specifically use a HashSet or similar structure that permits low-cost test of membership
  • Before checking if g1 and g2 are in s, check whether they are both already in t (which should be much smaller) first
ADD COMMENTlink written 5 weeks ago by robertobfisher10

I've only been using terminal on my mac. I do have R but I never used python.

ADD REPLYlink written 5 weeks ago by bweil20

Unfortunately I don't know many terminal commands so I wrote pseudocode instead. Hopefully you can convert it to a language you're familiar with.

ADD REPLYlink written 24 days ago by robertobfisher10

I've figured out the exact code for all of them except for "add g1, g2 to s." Is this just adding two files to another? Any ideas on the exact command?

ADD REPLYlink written 5 weeks ago by bweil20

s here should be a data structure that resembles a mathematical set (all elements unique). In Python, you can do s = set() to create an empty set and add elements with s.add(g1) and s.add(g2).

ADD REPLYlink written 24 days ago by robertobfisher10
1
gravatar for igor
29 days ago by
igor5.7k
United States
igor5.7k wrote:

I think you are looking for the join command. You can do something like: join archloro.txt ArabidopsisParalogue.txt.

More info and examples here: http://www.theunixschool.com/2012/01/join-command.html

ADD COMMENTlink written 29 days ago by igor5.7k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 945 users visited in the last hour