awk column number is a variable
2
1
Entering edit mode
3.8 years ago
DBScan ▴ 300

I am trying to filter GTEx for liver tissue. I was able to extract the sample names from the "SamplesAtrributesDS.txt" file and find the corresponding column number in the "transcript_expected_count.gct" file for each sample. I then want to use awk to extract the counts from the column number, but I always get this error:

awk: fatal: cannot open file `{print $myvar}' for reading (No such file or directory)

I use the following code:

#!/bin/bash
Genes=$(awk -F "\t" '{if ($6=="Liver") print $1}' GTEx_Analysis_v8_Annotations_SampleAttributesDS.txt)
for gene in $Genes
do
        colnr=$(head -n 3 GTEx_Analysis_2017-06-05_v8_RSEMv1.3.0_transcript_expected_count.gct | tail -n 1 | cut -f3- | tr '\t' '\n' | cat -n | grep $gene | cut -f 1)
        echo $colnr
        if [ -z "$colnr" ]
        then
                echo "\$colnr is empty"
        else
                echo $colnr
                awk -v myvar=$colnr '{print $myvar}' GTEx_Analysis_2017-06-05_v8_RSEMv1.3.0_transcript_expected_count.gct
        fi
done

I would be glad if someone can point out what I am doing wrong.

RNA-Seq awk bash • 2.5k views
ADD COMMENT
0
Entering edit mode

what is the output of

echo $colnr

before the error ? Furthermore, can you please use

echo "($colnr)"

to show if there is any space.

ADD REPLY
0
Entering edit mode

The output of echo "($colnr)" is

(   281)

Why is it that way?

ADD REPLY
0
Entering edit mode

because cat -n | insert some spaces before the line number.

So the command :

awk -v myvar=$colnr '{print $myvar}' file

becomes:

   awk -v myvar=  281  '{print $myvar}' file

where myvar is empty , 281 is the awk script and '{print $myvar}' and file are the filenames.

ADD REPLY
0
Entering edit mode

Remove $, try:

awk -v myvar=$colnr '{print myvar}'
ADD REPLY
2
Entering edit mode

no I think OP wants to print the myvar-th column

ADD REPLY
2
Entering edit mode
3.8 years ago

I would do something like this, in python, to filter for the columns of interest.

#!/usr/bin/env python3

path_ec = "~/Downloads/GTEx_Analysis_2017-06-05_v8_RSEMv1.3.0_transcript_expected_count.gct.gz"
path_sa = "~/Downloads/GTEx_Analysis_v8_Annotations_SampleAttributesDS.txt"

import pandas as pd
from itertools import count

SKIP_LINES_GTEx = 2
SEP = '\t'

df = pd.read_csv(path_sa, sep=SEP)
ids = df.SAMPID[df.SMTS == "Liver"]

all_cols = pd.read_csv(path_ec, sep=SEP, header=SKIP_LINES_GTEx, nrows=0).columns

# Not all SAMPIDs are in the transcript_expected_count file
ids = ids[ids.isin(all_cols)]

# Columns to keep
cols = ['transcript_id', 'gene_id'] + list(ids)

from tempfile import NamedTemporaryFile as TempFile
with TempFile(mode='w') as out_fd:

# OR:
# with open(output_filename, mode='w') as out_fd:

    print(F"Writing to temporary file {out_fd.name}")

    # Lines per chunk
    CHUNKSIZE = 1000

    # Read and write chunkwise
    for (df, n) in zip(pd.read_csv(path_ec, sep=SEP, header=SKIP_LINES_GTEx, chunksize=CHUNKSIZE, usecols=cols), count()):
        df.to_csv(out_fd, header=(n == 0), sep=SEP, mode='a', index=None)

    input("Done. Press ENTER")
ADD COMMENT
1
Entering edit mode
3.8 years ago

how about this ? (not verified)

$ wget -O - -q "https://storage.googleapis.com/gtex_analysis_v8/rna_seq_data/GTEx_Analysis_2017-06-05_v8_RSEMv1.3.0_transcript_expected_count.gct.gz" | gunzip -c | tail -n+3 | awk '(NR==1) {split($0,header);next;} {for(i=3;i<=NR;i++) printf("%s\t%s\t%s\t%s\n",$1,$2,header[i],$i);}' | grep -F -f <(wget -O - -q "https://storage.googleapis.com/gtex_analysis_v8/annotations/GTEx_Analysis_v8_Annotations_SampleAttributesDS.txt" | awk -F '\t' '($6=="Liver") {print $1;}' )
ENST00000481305.1   ENSG00000002933.7   GTEX-11DXY-0526-SM-5EGGQ    86.44
ENST00000484928.5   ENSG00000002933.7   GTEX-11DXY-0526-SM-5EGGQ    2781.95
ENST00000494349.5   ENSG00000002933.7   GTEX-11DXY-0526-SM-5EGGQ    1195.63
ENST00000000412.7   ENSG00000003056.7   GTEX-11DXY-0526-SM-5EGGQ    379.42
ENST00000536844.5   ENSG00000003056.7   GTEX-11DXY-0526-SM-5EGGQ    0
ENST00000537621.1   ENSG00000003056.7   GTEX-11DXY-0526-SM-5EGGQ    22.62
ENST00000539143.5   ENSG00000003056.7   GTEX-11DXY-0526-SM-5EGGQ    94.39
ENST00000540837.1   ENSG00000003056.7   GTEX-11DXY-0526-SM-5EGGQ    2.81
ENST00000541507.5   ENSG00000003056.7   GTEX-11DXY-0526-SM-5EGGQ    457.08
ENST00000543159.1   ENSG00000003056.7   GTEX-11DXY-0526-SM-5EGGQ    23.87
ENST00000543258.1   ENSG00000003056.7   GTEX-11DXY-0526-SM-5EGGQ    2.26
ENST00000543704.5   ENSG00000003056.7   GTEX-11DXY-0526-SM-5EGGQ    0
ENST00000543834.1   ENSG00000003056.7   GTEX-11DXY-0526-SM-5EGGQ    0
ENST00000543845.1   ENSG00000003056.7   GTEX-11DXY-0526-SM-5EGGQ    0
ENST00000544193.1   ENSG00000003056.7   GTEX-11DXY-0526-SM-5EGGQ    0
ENST00000544245.1   ENSG00000003056.7   GTEX-11DXY-0526-SM-5EGGQ    0
ENST00000262820.7   ENSG00000003096.14  GTEX-11DXY-0526-SM-5EGGQ    8.07
ENST00000262820.7   ENSG00000003096.14  GTEX-11DXZ-0126-SM-5EGGY    4.05
ENST00000371876.5   ENSG00000003096.14  GTEX-11DXY-0526-SM-5EGGQ    2.24
ENST00000371876.5   ENSG00000003096.14  GTEX-11DXZ-0126-SM-5EGGY    0
(...)
ADD COMMENT

Login before adding your answer.

Traffic: 1943 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6