Question: Technical question about python "to find the strings"
0
gravatar for horsedog
4 weeks ago by
horsedog30
horsedog30 wrote:

Hi there, I have two files, file 1 looks like this :

NP_208181.1
NP_220259.1
NP_224629.1
WP_232131
WP_3432434
WP_2441241221

File 2 looks like this:

NP_208181.1,GCF_000008525.1
NP_212206.1,GCF_000008685.2
NP_213866.1,GCF_000008625.1
NP_219784.1,GCF_000008725.1
NP_220151.1,GCF_000008725.1
NP_220259.1,GCF_000008725.1
NP_224628.1,GCF_000008745.1
NP_224629.1,GCF_000008745.1
NP_224939.1,GCF_000008745.1

My purpose is to find which ID in file 1 is in file 2 too, so here we can see NP_208181.1, NP_220259.1, NP_224629.1 can be found in file two, followed by GCF blabla, i wrote a small script like this :

import re
with open("file1") as ID, open("file2") as data:
  for line1, line2 in zip(ID,data):
    if line1 in line2:
      print(line1)

However, the result was blank, which does not make sense, any one knows why? how t modify this script?

python • 198 views
ADD COMMENTlink modified 28 days ago by shoujun.gu310 • written 4 weeks ago by horsedog30

Without testing it I think you're zipping the 2 lines together from each file, so it's only comparing line1 in file 1 with line 2 in file 2, then line 2 with line 2 etc You'll need 2 loops for this to work as you've got it - e.g:

for line1 in ID:
    for line2 in data:
        if line1 in line2

and so on..

I'd look in to using the any and all python keywords though, they may help here.

If you're not bothered about using python specifically, you could do this in a single line (sort of) with grep:

while read ID; do grep "$line" Data_file.txt ; done < ID_file.txt
ADD REPLYlink modified 4 weeks ago • written 4 weeks ago by jrj.healey3.4k

Hi, thanks for correction, but I tried , still blank, here is my new code:

with open("file") as ID, open("file2") as data:
    for line1 in ID:
        for line2 in data:
            if line1 in line2:
                print(line1)
ADD REPLYlink written 4 weeks ago by horsedog30

I believe this answer is important: A: Technical question about python "to find the strings"

ADD REPLYlink written 4 weeks ago by WouterDeCoster26k

what about comm?

Modo de empleo: comm [OPCIÓN]... FICHERO1 FICHERO2
Compara los ficheros ordenados FICHERO1 y FICHERO2 línea por línea.

Sin ninguna opción, produce un resultado en tres columnas. La columna
uno contiene las líneas únicas al FICHERO1, la columna dos contiene
las líneas únicas al FICHERO2, y la columna tres contiene las líneas
comunes a ambos ficheros.

  -1              suppress column 1 (lines unique to FILE1)
  -2              suppress column 2 (lines unique to FILE2)
  -3              suppress column 3 (lines that appear in both files)

  --check-order     check that the input is correctly sorted, even
                      if all input lines are pairable
  --nocheck-order   do not check that the input is correctly sorted
  --output-delimiter=STR  separate columns with STR
      --help     muestra esta ayuda y finaliza
      --version  informa de la versión y finaliza

Note, comparisons honor the rules specified by `LC_COLLATE'.

Examples:
  comm -12 file1 file2  Print only lines present in both file1 and file2.
  comm -3  file1 file2  Print lines in file1 not in file2, and vice versa.
ADD REPLYlink written 4 weeks ago by Buffo650

Fixed some duff logic in my answer, it should work for your case now.

ADD REPLYlink written 4 weeks ago by jrj.healey3.4k
2
gravatar for jomo018
4 weeks ago by
jomo018250
jomo018250 wrote:

First, you need to strip eol from the lines. For example line1.strip(). Second, with zip, you are testing line against corresponding line. This should catch the first line but none of the others.

ADD COMMENTlink written 4 weeks ago by jomo018250

hello, do you mean by this?

with open("file") as ID, open("file2") as data:
    for line1.strip() in ID:
        for line2.strip() in data:
            if line1.strip() in line2.strip():
                print(line1.strip())

?

ADD REPLYlink written 4 weeks ago by horsedog30
0
gravatar for jrj.healey
4 weeks ago by
jrj.healey3.4k
United Kingdom
jrj.healey3.4k wrote:

Combining my comment and jomo018s point about the line ending character (line stripping is only necessary from file 1 since the strings are contained within the line of file 2, but I've done both here anyway):

#!/bin/python

# assume the script is named comparelines.py
# invoke with the ID file as the first commandline arg, 
# and data file as commandline arg 2

import sys

with open(sys.argv[1], 'r') as ID_file, open(sys.argv[2], 'r') as data_file:
    IDs = [ID.strip() for ID in ID_file]
    data = [line.strip() for line in data_file]

    result = [j for i in IDs for j in data if i in j]

    for each in result:
        print(each)

So

$ python comparelines.py IDs.txt data.txt

gives:

NP_208181.1,GCF_000008525.1
NP_220259.1,GCF_000008725.1
NP_224629.1,GCF_000008745.1

EDIT

Fixed it.

ADD COMMENTlink modified 4 weeks ago • written 4 weeks ago by jrj.healey3.4k
0
gravatar for shoujun.gu
28 days ago by
shoujun.gu310
Rockville/MD
shoujun.gu310 wrote:

I believe all your input file are actually csv file

thus, the most efficient way is: 1) read these file into dataframe 2) inner join the column you want

ADD COMMENTlink written 28 days ago by shoujun.gu310
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 709 users visited in the last hour