Question: compare and search a list of ids in two text file
0
gravatar for Manoj
4.6 years ago by
Manoj30
Canada
Manoj30 wrote:

Hi,

I have two txt files are following below. I expect output result only similar ids line in both. I tried some perl scripts and grep commands as follows.. Hence I did not get my desired output.

$ grep -w -f file1.txt file2.txt >out.txt

$grep -wFf file1.txt file2.txt >out.txt

::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::

perl script

::::::::::::::::::::::::::::::::::::::::::::::::::::::

use strict;

use warnings;

use autodie;

 

my $f1 = shift || "file1.txt";

my $f2 = shift || "file2.txt";

my %results;

open my $file1, '<', $f1;

while (my $line = <$file1>)

{

$results{$line

} = 1 }

open my $file2, '<', $f2;

while (my $line = <$file2>)

{

    $results{$line}++

}

foreach my $line (sort { $results{$b} <=> $results{$a} } keys %results)

{

    print "$results{$line} Match found:  ", $line if $results{$line} > 1;

}

 

#########################################################

file 1:

AT1G01020.2  89247399:89248747

AT1G01050.1  89271467:89272751

AT1G01060.1  89274076:89277002

AT1G01070.1  89278983:89280958

AT1G01073.1  34927896:34928000

AT1G01090.1  89287790:89289247

AT1G01100.1  89290369:89290713

AT1G01100.3  81592809:81592958

AT1G01130.1  89302125:89303893

...........

#################################

file 2

AT1G01010.1  89243839-89245706

 AT1G01020.1  89246997-89247311

 AT1G01020.1  89248315-89248745

 AT1G01030.1  89251946-89253019

 AT1G01040.1  89263598-89270896

 AT1G01050.1  89271464-89272749

 AT1G01060.1  89274074-89276072

 AT1G01060.1  89276890-89277000

 AT1G01070.1  89278980-89280956

 AT1G01090.1  89287787-89289245

 AT1G01100.1  89290366-89290710

...........

alignment • 3.8k views
ADD COMMENTlink modified 4.6 years ago • written 4.6 years ago by Manoj30

What exactly do you need as output? A list of IDs that occur in both files or similar IDs (matching without the number after the '.')? Do you only need the IDs or do you need the second column too?

ADD REPLYlink written 4.6 years ago by nterhoeven120

I updated my script that may improve my aspect but it comparing two files based on full each lines in two files.

However, I need if either first column match in two files, rest of columns do not care, then should print results with some information like- total number of repeat match in both files.

thanks

ADD REPLYlink modified 4.6 years ago • written 4.6 years ago by Manoj30

the following command is working in good manner, but I am not able to results in some output information such as total repeat match of particular ids in both files...

$ awk 'NR==FNR{tgts[$1]; next} $1 in tgts' file1 file2

ADD REPLYlink written 4.6 years ago by Manoj30

IOnce you have a file with the particular IDs you want, try the following command to get a count of each ID:

sort your_file.txt | uniq -c
ADD REPLYlink modified 4 months ago by RamRS26k • written 4.6 years ago by James Ashmore2.8k
2
gravatar for James Ashmore
4.6 years ago by
James Ashmore2.8k
UK/Edinburgh/MRC Centre for Regenerative Medicine
James Ashmore2.8k wrote:

You could also do this with the UNIX command join:

join -1 1 -2 1 file1.txt file2.txt | awk '{print $1}'

It attempts to join the two files together based on common entries in the first column in each file.

ADD COMMENTlink modified 4.6 years ago • written 4.6 years ago by James Ashmore2.8k
1

Note that both files should be sorted before using join.

ADD REPLYlink written 4.6 years ago by Alex Reynolds29k

It's also a good idea to specify field separator, e.g. -t $'\t'

ADD REPLYlink modified 4 months ago by RamRS26k • written 4.6 years ago by 5heikki8.7k

this command result shows both file data in new file.

thanks

ADD REPLYlink written 4.6 years ago by Manoj30

I've updated my answer so that it only outputs the IDs which are in the first column of the results

ADD REPLYlink written 4.6 years ago by James Ashmore2.8k
0
gravatar for Janake
4.6 years ago by
Janake160
United States
Janake160 wrote:

If you don't need to user Perl, you can achieve this pretty easily in R.

You can have your file1.txt and file2.txt and labelled your gene names columns as V1 and second column as V2 (or any other name you want).

Then use merge function to merge the two files to get the common lines.

common_file <- merge(list2, list1, by = "V1")
common_file
         V1              V2.x              V2.y
1 AT1G01050.1 89271467:89272751 89271464-89272749
2 AT1G01060.1 89274076:89277002 89274074-89276072
3 AT1G01060.1 89274076:89277002 89276890-89277000
4 AT1G01070.1 89278983:89280958 89278980-89280956
5 AT1G01090.1 89287790:89289247 89287787-89289245
6 AT1G01100.1 89290369:89290713 89290366-89290710
ADD COMMENTlink modified 4 months ago by RamRS26k • written 4.6 years ago by Janake160

I am beginner for R. will I need install it at my Ubuntu or just use terminal for that?

thanks

ADD REPLYlink written 4.6 years ago by Manoj30

Simply, I typed at R prompt like

> common_file <- merge(file2.txt file1.txt, by = "V1")

it shows error::

Error: unexpected symbol in "common_file <- merge(file2.txt file1.txt"
> common_file <- merge(list2 list1, by = "V1")
Error: unexpected symbol in "common_file <- merge(list2 list1"
ADD REPLYlink modified 4 months ago by RamRS26k • written 4.6 years ago by Manoj30

OK, it is a little more complicated than that. Perhaps, James's solution is easier for you, but for the completion of my answer, here is what you need to do:

If you have R installed on your computer, open a terminal where your data files are and type R (upper case) to call R. I am going to call the data files, list1.txt and list2.txt, but you can give any name. Then do the following:

setwd("./") #Set working directory to current directory
list1 <- read.table("list1.txt") # create R object from your list1.txt file and assumes no column names.
colnames(list1) <- c("V1", "V2") # give column names
list2 <- read.table("list2.txt") # create R object from your list2.txt file and assumes no column names.
colnames(list2) <- c("V1", "V2")
common_file <- merge(list2, list1, by = "V1")
head(common_file) #To see the first few lines of the common_file 

#To write the common_file R object to a file

write.table(common_file, "common_file.txt", quote = FALSE, sep = "\t", row.names = FALSE)
"common_file.txt" file should be in the current directory. 
ADD REPLYlink written 4.6 years ago by Janake160
0
gravatar for Prakki Rama
4.6 years ago by
Prakki Rama2.4k
Singapore
Prakki Rama2.4k wrote:

Usually for this kind of comparison of IDs, I would use Venny or some other venn diagram generator which will achieve my purpose. 

ADD COMMENTlink modified 4.6 years ago • written 4.6 years ago by Prakki Rama2.4k

It's compare exact same lines, however I need to compare overlap concept number...do you have any other way to compare. in that manner I have been post at following

C: compare three test files

ADD REPLYlink modified 4.6 years ago • written 4.6 years ago by Manoj30
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1239 users visited in the last hour