Question: comparing two lists using command line tools
1
gravatar for Farbod
3.3 years ago by
Farbod3.3k
Toronto
Farbod3.3k wrote:

Dear Friends, Hi (Sorry if this question is simple or duplicated)

I have two lists of IDs, list A.txt and list B.txt and B>A (A=blast results IDs and B= the IDs from the original EST database that the blast has run against it).

I want to compare them and collect the IDs that are present in the list B (bigger list) but absent in the list A (the ESTs that blast can not find any hit for them).

Please help me how to do it in linux command line (please no perl or python. thanks)

NOTE1: I usually do it as below, but I need some more strigth approach:

1- sort both lists
2- $ comm A.txt B.txt | cut –f2 > specific-to-B.txt
3- $ sed -i '/^\s*$/d'  specific-to-B.txt (because this file contain many blank lines)
4- the list is ready

NOTE2: Example of list is as :

EV825482.1

EV825573.1

EV825616.1

EV825623.1

EV825663.1

EV825667.1

EV825673.1

EV825677.1

EV825680.1

bash blast sequence • 1.7k views
ADD COMMENTlink modified 3.3 years ago • written 3.3 years ago by Farbod3.3k
1

I want to compare them and collect the IDs that are present that are present in the list B (bigger list) but absent in the list A (the ESTs that blast can not find any hit for them).

from the comm manual :

   -1     suppress column 1 (lines unique to FILE1)

   -2     suppress column 2 (lines unique to FILE2)

   -3     suppress column 3 (lines that appear in both files)

because this file contain many blank lines

WHY ??

ADD REPLYlink modified 3.3 years ago • written 3.3 years ago by Pierre Lindenbaum126k

Dear Pierre Lindenbaum, Hi

When I use comm -1 fileA fileB | wc -l , it returned the fileB number (6088)

When I use comm -2 fileA fileB | wc -l , it returned the fileA number (5699)

When I use comm -3 fileA fileB | wc -l , it returned the number I want (389)

So I think the "suppress column 3 (lines that appear in both files)" must be changed to "the lines that are in B and not in A"

Is it correct ?

~Je vous remercie infiniment

ADD REPLYlink modified 3.3 years ago • written 3.3 years ago by Farbod3.3k

Is it correct ?

no, the correct way for ' present in the list B but absent in the list A ' would be: ('-1' remove lines unique to listA and '-3' lines that appear in A and B )

comm -13 fileA fileB
ADD REPLYlink written 3.3 years ago by Pierre Lindenbaum126k
1

is it normal ?

yes if there is no line unique to fileA.

But again, the correct way is 'comm -13'

ADD REPLYlink written 3.3 years ago by Pierre Lindenbaum126k

OK Pierre,

that was the problem (or better to say, the "reason") as the fileA IDs (blast results) is exactly exist in listB (blast database !

You are right as usual, ;-)

~ bon courage

ADD REPLYlink modified 3.3 years ago • written 3.3 years ago by Farbod3.3k

Thanks for the time and supports,

when I perform "comm -3 fileA fileB | wc -l" and "comm -13 fileA fileB" , both the results have 389 lines,

is it normal ?

ADD REPLYlink written 3.3 years ago by Farbod3.3k

Are those results identical?

ADD REPLYlink written 3.3 years ago by genomax78k

Hi,

Good question

And the answer is positive.

ADD REPLYlink written 3.3 years ago by Farbod3.3k

Hi Farbod,

I know it's not what you're asking. But I do this exact thing in Galaxy all the time. Galaxy wraps common command line tools, but you have a nice graphical interface to work in and the results are really easy to visualise. There are a couple of text manipulation tools in there that you could do this with.

Thought I'd put a comment here just in case your interested :-)

ADD REPLYlink written 3.3 years ago by ando.kelli40

Dear Ando.kelli,

Hi and thank you for your clever advice,

would you name some of the programs you have used for this purpose in Galaxy, please ?

ADD REPLYlink written 3.3 years ago by Farbod3.3k

Hi Farbod,

Sorry for the slow reply, I didn't get a notification saying that you responded to my comment.

If you install a local instance of Galaxy there are many text manipulation tools that are automatically included, and many that can be downloaded.

List of available tools can be found at the main Galaxy Toolshed: https://toolshed.g2.bx.psu.edu/

You can go to this site and browse tools. Alternatively, you can go to this site: https://usegalaxy.org/ and browse the list of options down the left hand side. I think the headings most relevant to you are: Text Manipulation, Convert Formats, Join Subtract Group, and Filter and Sort.

Hope that helps.

Kelli

ADD REPLYlink modified 3.2 years ago • written 3.2 years ago by ando.kelli40
0
gravatar for Benn
3.3 years ago by
Benn7.9k
Netherlands
Benn7.9k wrote:

Why don't you use uniq?

cat fileA fileB | sort | uniq -d
ADD COMMENTlink written 3.3 years ago by Benn7.9k

Dear b.nota, Hi and thanks

With my script, the wc -l return 389 and with your script it shows 5699 (which is fileA line number) ;-)

ADD REPLYlink written 3.3 years ago by Farbod3.3k

Sorry, you're right. I thought you wanted the overlap... My bad!

ADD REPLYlink written 3.3 years ago by Benn7.9k
1

Hi,

Not at all, it was a useful script for me,

Thanks

ADD REPLYlink written 3.3 years ago by Farbod3.3k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1368 users visited in the last hour