Question: How to compare several ASCII files in LINUX
0
gravatar for jomagrax
7 months ago by
jomagrax0
Spain
jomagrax0 wrote:

Hi everyone, I have several ASCII files containing genes expressed in different experimental conditions ( apple.conditionA apple.conditionB apple.conditionC ) The first column of all of them conteins the gene name, the other colums have information like the chromosome where It is, the direction etc. I need to extract in a .txt file the gene names that are only expressed in condition A (apple.genesA) using LINUX commands.

Thanks in advance

linux rna-seq • 442 views
ADD COMMENTlink modified 7 months ago • written 7 months ago by jomagrax0
5

Please be more specific. ASCII doesn’t narrow down what type of file you are trying to analyse, only the text encoding style.

ADD REPLYlink written 7 months ago by jrj.healey13k

Ok sorry, the files have all the same structure with several colums, I need to compare the first colum of all of them (where the gene names are) and then extract the unique genes of a concrete file.

ADD REPLYlink written 7 months ago by jomagrax0

You still haven’t told us what the files are.

Edit your question to include some example input data and the kind of structure you would like output.

ADD REPLYlink written 7 months ago by jrj.healey13k

Ok, I hope It´s clear now. Thank you for yor your time

ADD REPLYlink written 7 months ago by jomagrax0

It's not. Tell us how you obtained the files, which software was used and show an example.

ADD REPLYlink written 7 months ago by WouterDeCoster39k

We are not mind readers. How are we supposed to know what ‘condition A’ is, if you don’t show us the data? At the moment “is expressed in condition A” could be a Boolean, it could be some integer value, a floating point > some threshold?

If the data is confidential or something, you can make a mockup of the file which follows the same patterns with different context.

Please put more effort in else we will just close this post.

ADD REPLYlink written 7 months ago by jrj.healey13k

MDP0000303933 MDP0000303933 chr1 - 4276 5447

This is for instance the first line of the apple.conditionA file, on the first column we can see the gene name, the second column has de RNA read that was sequenced and asigned to the gene specified in colum one, the remaining columns give the chromosome, the direction of the gene, and It's coordinates.

All three files have the same structure, using linux command-lines, Is there a way to extract the unique genes expressed in the file apple.conditionA, comparing it to apple.conditionB and apple.conditionC?

Sorry for the vagueness of my questions but this is all very new to me, once again thank you for your help

ADD REPLYlink modified 7 months ago • written 7 months ago by jomagrax0

So the question is you just want all the lines in the condition A files which are unique (i.e. not in file B and C), based on column 1?

ADD REPLYlink written 7 months ago by jrj.healey13k

Yes, exactly! Thanks

ADD REPLYlink written 7 months ago by jomagrax0
2

Please refer to my solution below. You can achieve this by grep-ing against A for all patterns not matching the cat-ed first columns of B and C, which are gotten by cut-ing files B and C.

ADD REPLYlink written 7 months ago by RamRS22k

I think I have It,

$  cut -f1 apple.conditionA > compare | cut -f1 apple.conditionB apple.conditionC > tocompare
$ comm -12 compare tocompare

This way I need to create two files and to use two lines, but I can't think of anything else.

ADD REPLYlink modified 7 months ago by RamRS22k • written 7 months ago by jomagrax0
1

Temporary files work fine, but if you wish to not use files, check out process substitution

sort -u <(cut -f1 file.txt)

is the same as

cut -f1 file.txt | sort -u

Also, please don't use the command >file | command2 syntax. It maybe works now because your shell doesn't have MULTIOS enabled, but if you have MULTIOS enabled, it will pipe cut -f1 apple.conditionA to both the file compare as well as downstream to cut -f1 apple.conditionB, mangling the output and introducing unpredictable bugs bordering on file corruption to your pipeline.

ADD REPLYlink modified 7 months ago • written 7 months ago by RamRS22k

Ok, so finally I got

$ comm -12 <(cut -f1 apple.conditionA) <(cut -f1 apple.conditionB apple.conditionC)

I dindn´t know process Substitution structures existed, thank you very very much!!

ADD REPLYlink written 7 months ago by jomagrax0
1

Yep, its as simple as that!

ADD REPLYlink written 7 months ago by jrj.healey13k
4
gravatar for RamRS
7 months ago by
RamRS22k
Houston, TX
RamRS22k wrote:

You could use a combination of cut and diff or cut and comm or cut and grep to get to your results. Of course, you can also substitute cut with awk or do the entire thing in python or R.

Given how vague and obfuscated your description is at the moment, this is all the help I can give you.

ADD COMMENTlink written 7 months ago by RamRS22k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1270 users visited in the last hour