extract columns from file based on header selected from 2nd file
3
1
Entering edit mode
5.9 years ago

I have one large text table, tab separated. The first row is a header. I then have a second text file that contains a subset of the headers in the first file. I want to extract all the columns of the first file whose header is contained in the list given in the second file. Here an example of the inputs and the desired output:

DATA.TXT

   head0 head1 head2 head3 head4
1 25 1364 22 13
2 10 215 1 22  

LIST.TXT

head0
head4  

Desired output:

head0 head4
1 13
2 22
extract columns ID • 10k views
0
Entering edit mode

Can we have a few actual examples of the header names? It's easy if they do actually end 0, 1, 2, 3 ... but I assume they don't?

0
Entering edit mode

How can i do it in R? Thanks

0
Entering edit mode

How can i do it in R? Thanks

4
Entering edit mode
5.9 years ago
iraun ★ 3.8k

Just with a simple google search a hundred of questions/answers related to this topic can be easily found, specially, in stackoverflow: http://unix.stackexchange.com/questions/114829/display-column-that-has-a-specific-string

I'd suggest you to try to solve this kind of problems searching on internet and trying yourself before asking. In this case, you can get what you want with a little modification to the solution proposed on the link:

awk -F'\t' 'NR==FNR{arr[$1]++;next}{for(i=1; i<=NF; i++) if ($i in arr){a[i]++;}} { for (i in a) printf "%s\t", $i; printf "\n"}' file2 file1  ADD COMMENT 0 Entering edit mode This works perfectly. However, if LIST.TXT contains headers which do not exist in DATA.TXT, say headX and headY, and their order is "head0, headX, head4, headY", can the output include null values for headX and headY columns and also print output in the same order as in LIST.TXT file? ADD REPLY 1 Entering edit mode 5.9 years ago george.ry ★ 1.1k Ok, regardless of my comment above ... two bash commands should see you done ... paste <(seq 1$(head -1 DATA.TXT | wc -w))  <(head -1 DATA.TXT | tr '\t' '\n') > TMP
cat DATA.TXT | cut -f$(grep -wFf LIST.TXT TMP | cut -f1 | sed ':a;N;$!ba;s/\n/,/g') > OUTPUT


It's the sort of thing that's easier in R or something, though!

// EDIT // On a Mac (useless BSD rather than GNU tools!) apparently you'd have to alter the sed command there to ...

sed -e ':a' -e 'N' -e '\$!ba' -e 's/\n/,/g'

0
Entering edit mode

Thanks George, but it outputs this error message:

cut: option requires an argument -- 'f'

0
Entering edit mode
5.9 years ago

if you know the columns in advance, and depending upon the delimitator, you can simply use a cut command. If tabs are used, in this case you select the first and fifth column

cat DATA.TXT | cut -f 1,5 > output.file


This requires you to reveal the desired columns. If this is easy to check, see how easy the code is

If the delimiting is different to a tab, such as an empty space

cat DATA.TXT | cut -d " " -f 1,5 > output.file


I dare to put this here because of the simplicity of this code