Question: How to extract specific rows based on row number from a file
0
gravatar for vinayjrao
11 days ago by
vinayjrao100
JNCASR, India
vinayjrao100 wrote:

I am working on a RNA-Seq data set consisting of around 24000 rows (genes) and 1100 columns (samples), which is tab separated. For the analysis, I need to choose a specific gene set. It would be very helpful if there is a method to extract rows based on row number? It would be easier that way for me rather than with the gene names.

Below is an example of the data (4X4) -

gene    Sample1    Sample2    Sample3

A1BG       5658    5897      6064

AURKA    3656    3484      3415

AURKB    9479    10542    9895

From this, say for example, I want row 1, 3 and4, without a specific pattern

Thanks.

P.S. I have first asked this question on stackoverflow.com as this is very urgent.

file-handling shell extract • 134 views
ADD COMMENTlink modified 10 days ago by shenwei3563.6k • written 11 days ago by vinayjrao100

If this was urgent then you have received multiple answers here and over at Stackoverflow.

Please test all the solutions and validate them. If they all work then you can select multiple answers as accepted.


Upvote|Bookmark|Accept

Please do the same for your previous posts as well.

ADD REPLYlink modified 10 days ago • written 10 days ago by genomax48k
3
gravatar for toralmanvar
10 days ago by
toralmanvar300
toralmanvar300 wrote:

Alternative to fetching the information based on line number, if you have the list of the genes in one file, then you can try 'grep' command with '-f' option.

For example, if you have interested genes name in file "interested_genes.txt", then you can use:

grep -w -f interested_genes.txt complete_gene_set.txt >interested_genes_detail.txt

here,

  1. -w means it will serach for exact word in the complete_gene_set.txt file
  2. -f means it will grep PATTERN provided in interested_genes.txt from complete_gene_set.txt file
ADD COMMENTlink written 10 days ago by toralmanvar300
2
gravatar for Zhilong Jia
11 days ago by
Zhilong Jia1.3k
London
Zhilong Jia1.3k wrote:

cat fn0

1
3
4

cat fn1

gene    Sample1    Sample2    Sample3
A1BG       5658    5897      6064
AURKA    3656    3484      3415
AURKB    9479    10542    9895

awk 'NR==FNR{data[$1]; next}{if (FNR in data) print}' fn0 fn1

updated one : awk 'NR==FNR{data[$1]; next}FNR in data' fn0 fn1

gene    Sample1    Sample2    Sample3
AURKA    3656    3484      3415
AURKB    9479    10542    9895

FNR represents file number of record, which is useful to handle two files for awk.

ADD COMMENTlink modified 10 days ago • written 11 days ago by Zhilong Jia1.3k
1
gravatar for lieven.sterck
11 days ago by
lieven.sterck1.4k
Belgium, Ghent, VIB
lieven.sterck1.4k wrote:

how many rows do you need/want to select?

if not too many you could give awk a try:

awk 'NR==1 || NR==3 || NR==4' <file>
ADD COMMENTlink modified 11 days ago • written 11 days ago by lieven.sterck1.4k

Already ditched this method, because I need to select 224 :(

ADD REPLYlink written 11 days ago by vinayjrao100
1

if you have them in a file or so, just can simply loop over it and select the lines using the syntax below:

for i in <list>; do
 awk -v lnr="$i" 'NR==lnr' <file>
done

pass each linenumber to select as a variable to awk and select that line

ADD REPLYlink modified 11 days ago • written 11 days ago by lieven.sterck1.4k
1
gravatar for shenwei356
10 days ago by
shenwei3563.6k
China
shenwei3563.6k wrote:

I try to use cat -n to show print number, then grep -w -f.

$ cat -n data.tsv | sed -r 's/^\s+//'
1       gene    Sample1 Sample2 Sample3
2       A1BG    5658    5897    6064
3       AURKA   3656    3484    3415
4       AURKB   9479    10542   9895

$ cat -n data.tsv | sed -r 's/^\s+//' | grep -w -f n.txt 
1       gene    Sample1 Sample2 Sample3
3       AURKA   3656    3484    3415
4       AURKB   9479    10542   9895

$ cat -n data.tsv | sed -r 's/^\s+//' | grep -w -f n.txt  | cut -f 2-
gene    Sample1 Sample2 Sample3
AURKA   3656    3484    3415
AURKB   9479    10542   9895

But there's a potential bug, because grep searches the whole row, it may print false positive when other columns contains row number in n.txt, e.g.,

$ cat -n data.tsv | sed -r 's/^\s+//' | grep -w -f n.txt 
1       gene    Sample1 Sample2 Sample3
2       A1BG    5658    5897    3
3       AURKA   3656    3484    3415
4       AURKB   9479    10542   9895

You may use @Zhilong Jia 's solution.


Way of csvtk (kind of verbose but accurate -_-)

$ csvtk sample -H -t -n -p 1 data.tsv | csvtk grep -H -t -f 1 -P n.txt | csvtk cut -t -f -1
gene    Sample1 Sample2 Sample3
AURKA   3656    3484    3415
AURKB   9479    10542   9895

BTW, if the row number starts from 2nd line (ignoring header line)

$ csvtk sample -t -n -p 1 data.tsv | csvtk grep -t -f 1 -P n.txt | csvtk cut -t -f -1
gene    Sample1 Sample2 Sample3
A1BG    5658    5897    6064
AURKB   9479    10542   9895
ADD COMMENTlink modified 10 days ago • written 10 days ago by shenwei3563.6k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 629 users visited in the last hour