how to retrieve specific raws from a data matrix based on Affymetrix ID in Linux
3
1
Entering edit mode
8.0 years ago
Mo ▴ 920

Hello,

I have a matrix in which I am trying to retrieve specific rows from it and save it in text

An example matrix is

Affy ID       DDM1       DGKI2      FDGYYY1     GUHIL6
1438_at       0.0635     0.2065     -0.2112     0.0856
1487_at       0.071      -0.1315    0.0263      0.0198
1494_f_at     0.0045     -0.0237    0.0156      -0.1352
1598_g_at     -0.0541    0.0006     -0.1369     -0.0589
160020_at     -0.0925    0.2182     -0.1967     -0.0074
1729_at       -0.0017    -0.2209    -0.086      -0.0709
1773_at       -0.0273    -0.0181    0.1042      0.0136
177_at        -0.0276    -0.2563    0.3975      -0.0535
179_at        -0.0472    0.0979     -0.216      -0.2814
1861_at        0.0121    -0.4038    0.0016      0.0334
200000_s_at   -0.1021    -0.0887    -0.0452     0.0035


Lets say the name of this matrix is M.txt and the selected rows is in a list named mSelected.txt (consisting of 1438_at, 1729_at and 200000_s_at). My output should look like the following file

Affy ID       DDM1       DGKI2      FDGYYY1     GUHIL6
1438_at       0.0635     0.2065     -0.2112     0.0856
1729_at       -0.0017    -0.2209    -0.086      -0.0709
200000_s_at   -0.1021    -0.0887    -0.0452     0.0035


Is there also anyway to convert their Affy ID to Gene name?

$head /Users/Desktop/mSelected.txt | cat -vet 1438_at^M1729_at^M200000_s_at$ head /Users/Desktop/m.txt | cat -vet
Affy ID                             ^I DDM1               ^IDGKI2                ^IFDGYYY1       ^I GUHIL6^M1438_at^I0.0635^I0.2065^I-0.2112^I0.0856^M1487_at^I0.071^I-0.1315^I0.0263^I0.0198^M1494_f_at^I0.0045^I-0.0237^I0.0156^I-0.1352^M1598_g_at^I-0.0541^I0.0006^I-0.1369^I-0.0589^M160020_at^I-0.0925^I0.2182^I-0.1967^I-0.0074^M1729_at^I-0.0017^I-0.2209^I-0.086^I-0.0709^M1773_at^I-0.0273^I-0.0181^I0.1042^I0.0136^M177_at^I-0.0276^I-0.2563^I0.3975^I-0.0535^M179_at^I-0.0472^I0.0979^I-0.216^I-0.2814^M1861_at^I0.0121^I-0.4038^I0.0016^I0.0334^M200000_s_at^I-0.1021^I-0.0887^I-0.0452^I0.0035

affymetrix linux matrix • 2.7k views
2
Entering edit mode
8.0 years ago
5heikki 11k
awk 'FNR==NR{a[$0];next}($1 in a)' mSelected.txt M.txt


If the files are huge, join could be the fastest way:

 join -t "<TAB>*" -1 1 -2 1 -o 1.1,1.2,1.3,1.4,1.5 <(sort -k1,1 M.txt) <(sort -k1,1 mSelected.txt) > output


*literal tab = ctrl-v-tab

1
Entering edit mode

Thanks. of course the file is huge and the awk works fine but I don't see any output !!! do you have any idea where the output is saved? Note that I use first Cat for both as follows:

cat mSelected.txt M.txt


then I run your awk line

0
Entering edit mode

You don't wanna use cat when the awk command is designed to read from the files. And output is stored in the file that follows the > operator in any UNIX command.

1
Entering edit mode

I used the following command but the output is empty.

awk 'FNR==NR{a[$0];next}($1 in a)' /User/Mohammad/Desktop/mSelected.txt /User/Mohammad/Desktop/M.txt > output.txt


Do you know where the problem could be?

0
Entering edit mode

Could you give us the output of:

head /User/Mohammad/Desktop/mSelected.txt | cat -A


and

head /User/Mohammad/Desktop/M.txt | cat -A

0
Entering edit mode

head /User/Mohammad/Desktop/mSelected.txt | cat -vet

1
Entering edit mode

Is that a Mac cat? I knew there was a reason I always used cat -te and not cat -A

2
Entering edit mode

brew install coreutilsmakes life with Mac OS X so much easier..

1
Entering edit mode

Yep. That's the reason I don't remember BSD specific syntaxes. I have homebrew managing GNU-coreutils with bash on my Mac.

0
Entering edit mode

I do use TextWrangler but my real data is HUGE and I am afraid I won't be able to paste or even open it. I am searching for a way to make it right

I used

sed 's/^M$//' M.txt > MCorrected.txt sed 's/^M$//' mSelected.txt > mSelectedCorrected.txt


then I used

awk 'FNR==NR{a[$0];next}($1 in a)' /User/Mohammad/Desktop/mSelectedCorrected.txt /User/Mohammad/Desktop/MCorrected.txt > output.txt


The same as before, I get an empty output

1
Entering edit mode

You can use pastebin or GitHub gist

1
Entering edit mode

You'd need to use extended sed with the regular expression. Also, your regex cannot be ^M$ if you wish to match a ^M character at the end of the line, because ^ is a meta for the beginning of the line. Use sed -e 's/\r//' input >output Also, you should not need to open the entire file to just pick the top 10 lines. ADD REPLY 0 Entering edit mode It did not work RamRS , still getting an empty output ADD REPLY 1 Entering edit mode Just post the whole file somewhere (dropbox, google drive, etc.) so someone can just directly determine what you need to do to clean it. The amount of effort that others have put into doing this remotely is a bit excessive. ADD REPLY 1 Entering edit mode I found where the problem was and I solved it! Thank you very much ADD REPLY 0 Entering edit mode ADD REPLY 0 Entering edit mode I'd missed that in the deluge of comments :P ADD REPLY 0 Entering edit mode I think everyone did. I really think we need to add instructions on Add Comment/Answer/Post to use Gist or Pastebin to add files that are too large not optimal for the viewer here. Plus, they also have better formatting, syntax highlighting and line numbering, so I'd prefer that any day over pasting long code here. ADD REPLY 0 Entering edit mode Did you try to convert the files by any other way? You could e.g. try the "tr method" from my link. You are doing science, show some initiative.. Then when you do the head command on the corrected M.txt, it should look like: value^Ivalue^Ivalue^Ivalue^Ivalue$
value^Ivalue^Ivalue^Ivalue^Ivalue$value^Ivalue^Ivalue^Ivalue^Ivalue$


While the corrected mSelected.txt should look like:

value$value$


And then if the awk command still returns empty then it means that the IDs from mSelected.txt do not exist in the first column of M.txt.

0
Entering edit mode

Your files do not appear to have LF end of line markers. So here, e.g. awk sees only one line in your mSelected.txt file that contains all three patterns, when they should be on separate lines. Here are some ways for converting your files, or if you have e.g. TextWrangler installed you can do it there..

sed 's/^M$//' M.txt > MCorrected.txt sed 's/^M$//' mSelected.txt > mSelectedCorrected.txt


And then use the corrected files for the awk command. Should work..

0
Entering edit mode

That might be my mistake (during moderation copy-paste). I'll change that now.

Nope, OP's file looks like it has ^M characters plus a mix of tabs and spaces. Will def need a bit of cleaning.

0
Entering edit mode

I don't think you can replace ^M with a literal ^M. You might have to use \r

0
Entering edit mode
8.0 years ago

While you can just use grep to get the lines from the file you want, to convert the probe IDs to something useful, you'll want to (1) load the file into R, (2) install & load the appropriate annotation file from bioconductor and (3) use the select() function followed by either merge() or left_join() from dplyr to actually annotate things.

0
Entering edit mode

Thanks for your message, I am trying to check them out!

0
Entering edit mode
8.0 years ago
Ram 37k
grep -f mSelected.txt M.txt >output.txt


Note: This is not the working solution. It's a lead. I should probably just have said "use grep -few"

2
Entering edit mode

Potential problems with this one if column 1 values can be found elsewhere. Also, I would include -w in case some ids are sub-strings of other ids..

0
Entering edit mode

If I had a nickel for every time I've mucked something up by not using -w with grep...

0
Entering edit mode

I rarely use grep when the match is a \bPATTERN\b. For some reason, rarely do I get some perfect scenarios to work with.

0
Entering edit mode

Thanks but by using this line, I can only make an empty output

0
Entering edit mode

You'll have to tweak it with -w and such modifiers. You're better off using cat and xargs with the pattern ^<line>\b. This one liner was just supposed to be a lead to a solution, not the solution itself.