Question: Extract rows from file two having the 4th column ( which contain /1 /2 signs ) identical to column 7 from file 1 ( which doesn't have /1 /2 signs )
0
gravatar for i.jabre26
10 days ago by
i.jabre2610
i.jabre2610 wrote:

File1

 chr  start1 end1 start2 end2 score Tag name
1     1000  2000  3000   4000   0  K00230:40:HNWJLBBXX:4:1101:1002:24155 
1     8000  9000  10000  11000  0  K00230:40:HNWJLBBXX:4:1101:1002:35936

File2

chr  start1 end1 score  Tag name 
1     8000  9000   0    K00230:40:HNWJLBBXX:4:1101:1002:35936/1 
1     10000 11000  0    K00230:40:HNWJLBBXX:4:1101:1002:35936/2
1     1000  2000   0    K00230:40:HNWJLBBXX:4:1101:1002:35900/1 
1     3000  4000   0    K00230:40:HNWJLBBXX:4:1101:1002:35900/2

expected output

  chr  start1 end1 score  Tag name  
  1    8000  9000   0    K00230:40:HNWJLBBXX:4:1101:1002:35936/1 
  1    10000 11000  0    K00230:40:HNWJLBBXX:4:1101:1002:35936/2

May I please have some help to get the desired output

Thanks

python • 122 views
ADD COMMENTlink modified 10 days ago by cpad011210k • written 10 days ago by i.jabre2610

what happened to your post here : Extract rows having the 11th column values lies between 2nd and 3nd of a second file if 1st column matches ?? i.jabre26

ADD REPLYlink modified 10 days ago • written 10 days ago by cpad011210k

I already worked that out :)

ADD REPLYlink written 10 days ago by i.jabre2610
2
gravatar for shawn.w.foley
10 days ago by
shawn.w.foley130
shawn.w.foley130 wrote:

You can do this pretty easily in python, you'll read through File 1 and save the 7th column in a list. Then read through File 2, delimit the 5th column on the slash ("/") and if the element before the slash is in your list then print the line. Example code would be:

inFile1 = open('file1','r')
inFile2 = open('file2','r')
printList = []

for line in inFile1:
    lineList = line.rstrip().split('\t') #Generate a list split on the tab
    printList.append(lineList[6])  #Append your tag to printList

header = inFile2.readline()  #Remove header
print(header.rstrip())

for line in inFile2:
    lineList = line.rstrip().split('\t') #Generate a list split on the tab
    (tag,junk) = lineList[4].split('/') #Split your tag on the slash
    if tag in printList:
        print(line.rstrip())

Personally, I'd recommend doing this in awk since you can get through it with just one line of code:

awk 'BEGIN{OFS="\t";FS="\t"} NR==FNR{tagArray[$7];next} NR>FNR{split($5,ARRAY,"/"); if (ARRAY[1] in tagArray) print $0}' file1 file2

This uses tabs as delimiters (BEGIN{OFS="\t";FS="\t}) then it reads in the first file and saves the 7th column in an array called tagArray (NR==FNR{tagArray[$7];next}). After it starts reading in the second file, it splits the 4th column on the slash, and if the first element of the 5th column is in tagArray is prints the line (NR>FNR{split($4,ARRAY,"/"); if (ARRAY[1] in tagArray) print $0}).

ADD COMMENTlink modified 10 days ago • written 10 days ago by shawn.w.foley130

Thank you for your codes. I have tried them by both are giving me an error. Did you try them on the test example I posted. Maybe the problem is from my files...

ADD REPLYlink written 10 days ago by i.jabre2610

Apologies, I edited the above code. The tag name is in column 5 of the file2, the original code assumed it was in column 4. I also added a line in the python script to remove the header from file2.

ADD REPLYlink written 10 days ago by shawn.w.foley130

Thank you for editing the codes :).. I'm still getting an error tho. Please see below the error I'm getting for the python code

Traceback (most recent call last): File "iNPS_tags/Trial.py", line 8, in <module> printList.append(lineList[6]) #Append your tag to printList IndexError: list index out of range

Concerning the awk code it is printing the following where I'm only intrested in printing only the tag ending with 35936 since they exist in both files 1 and 2. AS you can see tags ending with 35900 are not needed since they don;t exist in both files.. Thank you so much in advance !

chr  start1 end1 score  tag 
1     8000  9000   0    K00230:40:HNWJLBBXX:4:1101:1002:35936/1 
1     10000 11000  0    K00230:40:HNWJLBBXX:4:1101:1002:35936/2
1     1000  2000   0    K00230:40:HNWJLBBXX:4:1101:1002:35900/1 
1     3000  4000   0    K00230:40:HNWJLBBXX:4:1101:1002:35900/2
ADD REPLYlink modified 10 days ago • written 10 days ago by i.jabre2610

The error from the python code, printList.append(lineList[6]) #Append your tag to printList IndexError: list index out of range indicates that the script reaches a line where there are not seven columns in you file. Is this a typo in the file?

And I'm not sure why you're having an issue with the awk code, using the example you provided I see the expected output:

awk 'BEGIN{OFS="\t";FS="\t"} NR==FNR{tagArray[$7];next} NR>FNR{split($5,ARRAY,"/"); if (ARRAY[1] in tagArray) print $0}' file1 file2
1   8000    9000    0   K00230:40:HNWJLBBXX:4:1101:1002:35936/1 
1   10000   11000   0   K00230:40:HNWJLBBXX:4:1101:1002:35936/2

Am I misunderstanding your question or are you getting a different result?

ADD REPLYlink written 7 days ago by shawn.w.foley130

Thank you so much for your reply. No you didn't misunderstand my question at all. This is the desired output. you were very helpful. I guess I'm doing something wrong because I'm getting different result using the awk code

Thank you so much !

ADD REPLYlink written 7 days ago by i.jabre2610
1

The code worked perfectly fine ! thanks

ADD REPLYlink written 7 days ago by i.jabre2610

Hello again, I was wondering if for the python code or for the awk code if we can specify the core number., Many thanks :)

ADD REPLYlink written 1 day ago by i.jabre2610
0
gravatar for cpad0112
10 days ago by
cpad011210k
India
cpad011210k wrote:

output (add header to the output):

$ tail -n+2 test1.txt  | cut -f7 | grep -f - test2.txt 

1   8000    9000    0   K00230:40:HNWJLBBXX:4:1101:1002:35936/1 
1   10000   11000   0   K00230:40:HNWJLBBXX:4:1101:1002:35936/2

input;

$ tail -n+1 *.txt
==> test1.txt <==
chr start1  end1    start2  end2    score   Tag name
1   1000    2000    3000    4000    0   K00230:40:HNWJLBBXX:4:1101:1002:24155   
1   8000    9000    10000   11000   0   K00230:40:HNWJLBBXX:4:1101:1002:35936

==> test2.txt <==
chr start1  end1    score   Tag name    
1   8000    9000    0   K00230:40:HNWJLBBXX:4:1101:1002:35936/1 
1   10000   11000   0   K00230:40:HNWJLBBXX:4:1101:1002:35936/2
1   1000    2000    0   K00230:40:HNWJLBBXX:4:1101:1002:35900/1 
1   3000    4000    0   K00230:40:HNWJLBBXX:4:1101:1002:35900/2
ADD COMMENTlink modified 10 days ago • written 10 days ago by cpad011210k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1241 users visited in the last hour