Question

PYTHON: Differences between gene files: Result unexpected when genes have duplicates

0

Entering edit mode

18 months ago

ciaki • 0

Hi Everyone, I am a Biologist {NOT A PROGRAMMER} and trying to syntax my own code to find differences between my data files.

File1.txt: Orange, orange, apple, pear

File2.txt: pear, Pear, Kiwi

Output.txt: -Orange -Orange -apple -pear +pear +Pear +Kiwi

In this case lowercase "pear" is the only common fruit between my files and thus the output shows both +pear and -pear. But this is not extremely helpful because I want to use this code for really long gene lists. Is there some way to further filter the common fruit and display them for example without a "+" or "-" the output.txt. As this is not very helpful to have to go through what has + and - in a very big list full of duplicates.

this is my code:


>     import difflib
>     
>     with open('/Users/.../file1.txt') as file_1:file_1_text = file_1.readlines()
>     with open('/Users/.../file2.txt') as file_2:file_2_text = file_2.readlines()
>     
>     mfile = open('output.txt', 'w')    
>     
>     for line in difflib.unified_diff(file_1_text, file_2_text,fromfile='file1.txt',tofile='/Users/.../file2.txt',
> lineterm=''):    
>         mfile.write("%s\n" % line)    
>         print(line)

python • 966 views

ADD COMMENT • link updated 18 months ago by Wayne ★ 2.0k • written 18 months ago by ciaki • 0

0

Entering edit mode

What you are trying to perform is commonly known as "Set operations" in programming. So this keyword should help you to google what you need - at the first glance, this tutorial seems quite appropriate.

ADD REPLY • link 18 months ago by Matthias Zepper 4.5k

0

Entering edit mode

If using python is not a requirement, an easy approach to find the common genes between two files could be first to convert your files and replace the commas by new lines:

sed 's/, /\n/g' file1 > file1_out.txt 
sed 's/, /\n/g' file2 > file2_out.txt

And then find the common elements between these two new files using grep:

grep -wFf file1_out.txt file2_out.txt > common.txt

ADD REPLY • link 18 months ago by iraun 6.2k

0

Entering edit mode

It is not that I have to use python but is is preferable because when I fix it people in my lab will use it too !

ADD REPLY • link 18 months ago by ciaki • 0

0

Entering edit mode

I guess people in your lab could use bash just as they would use python? I personally find bash and awk faster and simpler when it comes to straightforward file parsing problems, as the current issue of finding common elements between two files.

ADD REPLY • link 18 months ago by iraun 6.2k

0

Entering edit mode

Well, if your goal is not to learn Python, but to provide your lab with an easy way to intersect gene lists, then I would recommend a browser-based GUI approach.

Galaxy has a rudimentary intersection feature, but much nicer is Intervene (documentation), which can also create beautiful figures.

ADD REPLY • link 18 months ago by Matthias Zepper 4.5k

score 0 · Answer 1 · 2022-10-14

0

Entering edit mode

18 months ago

Joe 21k

You are basically looking for this:

https://stackoverflow.com/questions/9585218/python-find-common-text-in-two-files

ADD COMMENT • link 18 months ago by Joe 21k

0

Entering edit mode

The OP may want to look into the string methods .lower() and .upper() in the Python documentation. Combining those in to the set building highlighted in that StackOverflow post can make it case insensitive. It may not matter with actual genes; however, with the included toy example it matters. And sometimes it is best to build it in to be sure you've eliminated the possibility of that issue arising.

ADD REPLY • link 18 months ago by Wayne ★ 2.0k