Question

How to extract specific rows based on row number from a file

0

Entering edit mode

5.9 years ago

vinayjrao ▴ 250

I am working on a RNA-Seq data set consisting of around 24000 rows (genes) and 1100 columns (samples), which is tab separated. For the analysis, I need to choose a specific gene set. It would be very helpful if there is a method to extract rows based on row number? It would be easier that way for me rather than with the gene names.

Below is an example of the data (4X4) -

gene    Sample1    Sample2    Sample3
A1BG       5658    5897      6064
AURKA    3656    3484      3415
AURKB    9479    10542    9895

From this, say for example, I want row 1, 3 and4, without a specific pattern

Thanks.

P.S. I have first asked this question on stackoverflow.com as this is very urgent.

shell • 16k views

ADD COMMENT • link updated 13 months ago by Ram 43k • written 5.9 years ago by vinayjrao ▴ 250

0

Entering edit mode

If this was urgent then you have received multiple answers here and over at Stackoverflow.

Please test all the solutions and validate them. If they all work then you can select multiple answers as accepted.

Upvote|Bookmark|Accept

Please do the same for your previous posts as well.

ADD REPLY • link 5.9 years ago by GenoMax 141k

score 3 · Answer 1 · 2018-05-16

cat fn0

1
3
4

cat fn1

gene    Sample1    Sample2    Sample3
A1BG       5658    5897      6064
AURKA    3656    3484      3415
AURKB    9479    10542    9895

awk 'NR==FNR{data[$1]; next}{if (FNR in data) print}' fn0 fn1

updated one : awk 'NR==FNR{data[$1]; next}FNR in data' fn0 fn1

gene    Sample1    Sample2    Sample3
AURKA    3656    3484      3415
AURKB    9479    10542    9895

FNR represents file number of record, which is useful to handle two files for awk.

score 3 · Answer 2 · 2018-05-16

Alternative to fetching the information based on line number, if you have the list of the genes in one file, then you can try 'grep' command with '-f' option.

For example, if you have interested genes name in file "interested_genes.txt", then you can use:

grep -w -f interested_genes.txt complete_gene_set.txt >interested_genes_detail.txt

here,

-w means it will serach for exact word in the complete_gene_set.txt file
-f means it will grep PATTERN provided in interested_genes.txt from complete_gene_set.txt file

score 2 · Answer 3 · 2018-05-16

2

Entering edit mode

5.9 years ago

lieven.sterck 15k

how many rows do you need/want to select?

if not too many you could give awk a try:

awk 'NR==1 || NR==3 || NR==4' <file>

ADD COMMENT • link 5.9 years ago by lieven.sterck 15k

0

Entering edit mode

Already ditched this method, because I need to select 224 :(

ADD REPLY • link 5.9 years ago by vinayjrao ▴ 250

1

Entering edit mode

if you have them in a file or so, just can simply loop over it and select the lines using the syntax below:

for i in <list>; do
 awk -v lnr="$i" 'NR==lnr' <file>
done

pass each linenumber to select as a variable to awk and select that line

ADD REPLY • link 5.9 years ago by lieven.sterck 15k

score 2 · Answer 4 · 2018-05-16

I try to use cat -n to show print number, then grep -w -f.

$ cat -n data.tsv | sed -r 's/^\s+//'
1       gene    Sample1 Sample2 Sample3
2       A1BG    5658    5897    6064
3       AURKA   3656    3484    3415
4       AURKB   9479    10542   9895

$ cat -n data.tsv | sed -r 's/^\s+//' | grep -w -f n.txt 
1       gene    Sample1 Sample2 Sample3
3       AURKA   3656    3484    3415
4       AURKB   9479    10542   9895

$ cat -n data.tsv | sed -r 's/^\s+//' | grep -w -f n.txt  | cut -f 2-
gene    Sample1 Sample2 Sample3
AURKA   3656    3484    3415
AURKB   9479    10542   9895

But there's a potential bug, because grep searches the whole row, it may print false positive when other columns contains row number in n.txt, e.g.,

$ cat -n data.tsv | sed -r 's/^\s+//' | grep -w -f n.txt 
1       gene    Sample1 Sample2 Sample3
2       A1BG    5658    5897    3
3       AURKA   3656    3484    3415
4       AURKB   9479    10542   9895

You may use @Zhilong Jia 's solution.

Way of csvtk (kind of verbose but accurate -_-)

$ csvtk sample -H -t -n -p 1 data.tsv | csvtk grep -H -t -f 1 -P n.txt | csvtk cut -t -f -1
gene    Sample1 Sample2 Sample3
AURKA   3656    3484    3415
AURKB   9479    10542   9895

BTW, if the row number starts from 2nd line (ignoring header line)

$ csvtk sample -t -n -p 1 data.tsv | csvtk grep -t -f 1 -P n.txt | csvtk cut -t -f -1
gene    Sample1 Sample2 Sample3
A1BG    5658    5897    6064
AURKB   9479    10542   9895