compare and search a list of ids in two text file
3
0
Entering edit mode
8.6 years ago
Kumar ▴ 170

Hi,

I have two txt files are following below. I expect output result only similar ids line in both. I tried some perl scripts and grep commands as follows. Hence I did not get my desired output.

$ grep -w -f file1.txt file2.txt >out.txt
$ grep -wFf file1.txt file2.txt >out.txt

perl script:

use strict;
use warnings;
use autodie;

my $f1 = shift || "file1.txt";
my $f2 = shift || "file2.txt";

my %results;
open my $file1, '<', $f1;

while (my $line = <$file1>)
{
$results{$line
} = 1 }
open my $file2, '<', $f2;

while (my $line = <$file2>)
{
    $results{$line}++
}

foreach my $line (sort { $results{$b} <=> $results{$a} } keys %results)
{
    print "$results{$line} Match found: ", $line if $results{$line} > 1;
}

file 1:

AT1G01020.2  89247399:89248747
AT1G01050.1  89271467:89272751
AT1G01060.1  89274076:89277002
AT1G01070.1  89278983:89280958
AT1G01073.1  34927896:34928000
AT1G01090.1  89287790:89289247
AT1G01100.1  89290369:89290713
AT1G01100.3  81592809:81592958
AT1G01130.1  89302125:89303893
...........

file 2

AT1G01010.1  89243839-89245706
AT1G01020.1  89246997-89247311
AT1G01020.1  89248315-89248745
AT1G01030.1  89251946-89253019
AT1G01040.1  89263598-89270896
AT1G01050.1  89271464-89272749
AT1G01060.1  89274074-89276072
AT1G01060.1  89276890-89277000
AT1G01070.1  89278980-89280956
AT1G01090.1  89287787-89289245
AT1G01100.1  89290366-89290710
...........
alignment • 6.6k views
ADD COMMENT
0
Entering edit mode

What exactly do you need as output? A list of IDs that occur in both files or similar IDs (matching without the number after the '.')? Do you only need the IDs or do you need the second column too?

ADD REPLY
0
Entering edit mode

I updated my script that may improve my aspect but it comparing two files based on full each lines in two files.

However, I need if either first column match in two files, rest of columns do not care, then should print results with some information like- total number of repeat match in both files.

Thanks

ADD REPLY
0
Entering edit mode

The following command is working in good manner, but I am not able to results in some output information such as total repeat match of particular ids in both files.

$ awk 'NR==FNR{tgts[$1]; next} $1 in tgts' file1 file2
ADD REPLY
0
Entering edit mode

IOnce you have a file with the particular IDs you want, try the following command to get a count of each ID:

sort your_file.txt | uniq -c
ADD REPLY
2
Entering edit mode
8.6 years ago
James Ashmore ★ 3.4k

You could also do this with the UNIX command join:

join -1 1 -2 1 file1.txt file2.txt | awk '{print $1}'

It attempts to join the two files together based on common entries in the first column in each file.

ADD COMMENT
1
Entering edit mode

Note that both files should be sorted before using join.

ADD REPLY
0
Entering edit mode

It's also a good idea to specify field separator, e.g. -t $'\t'

ADD REPLY
0
Entering edit mode

This command result shows both file data in new file.

Thanks

ADD REPLY
0
Entering edit mode

I've updated my answer so that it only outputs the IDs which are in the first column of the results

ADD REPLY
0
Entering edit mode
8.6 years ago
Janake ▴ 170

If you don't need to use Perl, you can achieve this pretty easily in R.

You can have your file1.txt and file2.txt and labelled your gene names columns as V1 and second column as V2 (or any other name you want).

Then use merge function to merge the two files to get the common lines.

common_file <- merge(list2, list1, by = "V1")
common_file

         V1              V2.x              V2.y
1 AT1G01050.1 89271467:89272751 89271464-89272749
2 AT1G01060.1 89274076:89277002 89274074-89276072
3 AT1G01060.1 89274076:89277002 89276890-89277000
4 AT1G01070.1 89278983:89280958 89278980-89280956
5 AT1G01090.1 89287790:89289247 89287787-89289245
6 AT1G01100.1 89290369:89290713 89290366-89290710
ADD COMMENT
0
Entering edit mode

I am a beginner in R. Will I need to install it at my Ubuntu or just use terminal for that?

Thanks

ADD REPLY
0
Entering edit mode

Simply, I typed at R prompt like

> common_file <- merge(file2.txt file1.txt, by = "V1")

it shows error::

Error: unexpected symbol in "common_file <- merge(file2.txt file1.txt"
> common_file <- merge(list2 list1, by = "V1")
Error: unexpected symbol in "common_file <- merge(list2 list1"
ADD REPLY
0
Entering edit mode

OK, it is a little more complicated than that. Perhaps, James's solution is easier for you, but for the completion of my answer, here is what you need to do:

If you have R installed on your computer, open a terminal where your data files are and type R (upper case) to call R. I am going to call the data files, list1.txt and list2.txt, but you can give any name. Then do the following:

setwd("./") #Set working directory to current directory
list1 <- read.table("list1.txt") # create R object from your list1.txt file and assumes no column names.
colnames(list1) <- c("V1", "V2") # give column names
list2 <- read.table("list2.txt") # create R object from your list2.txt file and assumes no column names.
colnames(list2) <- c("V1", "V2")
common_file <- merge(list2, list1, by = "V1")
head(common_file) #To see the first few lines of the common_file

#To write the common_file R object to a file
write.table(common_file, "common_file.txt", quote = FALSE, sep = "\t", row.names = FALSE)

common_file.txt file should be in the current directory.

ADD REPLY
0
Entering edit mode
8.6 years ago
Prakki Rama ★ 2.7k

Usually for this kind of comparison of IDs, I would use Venny or some other venn diagram generator which will achieve my purpose.

ADD COMMENT
0
Entering edit mode

It's compare exact same lines, however I need to compare overlap concept number. Do you have any other way to compare. in that manner I have been post at following

compare three test files

ADD REPLY

Login before adding your answer.

Traffic: 2955 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6