Question

Extract only rows with main chromosomes (1-22, X, Y) on first column?

4

Entering edit mode

5.7 years ago

star ▴ 350

I have a table like below, it is a bed file of genome coordinate, I would like to keep only rows with numbers.

Input:

1           141009669   141009952
9           141016322   141016973
GL000195.1  81719   82468
GL000195.1  142613  142923
GL000220.1  119445  119746
HG115_PATCH 101957832   101958132
HG1308_PATCH 130205069  130205369
HG1308_PATCH 130205406  130205773
HG748_PATCH  77577953   77578264
X            200983 202660
y         205180    205702

output:

1           141009669   141009952
9           141016322   141016973
X            200983 202660
y            205180 205702

Thanks in advance!

linux grep R script • 6.8k views

ADD COMMENT • link updated 5.7 years ago by benformatics 3.9k • written 5.7 years ago by star ▴ 350

0

Entering edit mode

x and y are not numbers but I get what you are asking for. You only want to keep main chromosomes?

ADD REPLY • link 5.7 years ago by GenoMax 141k

0

Entering edit mode

yes, exactly, I want just main chromosomes.

ADD REPLY • link 5.7 years ago by star ▴ 350

1

Entering edit mode

Tell us why you have tried so far? If you are interested in fixing your own attempt.

ADD REPLY • link 5.7 years ago by GenoMax 141k

0

Entering edit mode

5.7 years ago

benformatics 3.9k

You can almost do this in one line with R and GenomicFeatures + rtracklayer

library(GenomicFeatures)
library(rtracklayer)

## read your table as a random text file
keepStandardChromosomes(GRanges(read.table('your_table.bed',col.names=c('chr','start','stop'))),pruning.mode='coarse')

## read your table in as a bed - if it really is a bonafide .bed then you can simplify a bit 
keepStandardChromosomes(import.bed('your_table.bed'),pruning.mode='coarse')

## do the same thing as above but simultaneously save it as a new bed file named "your_table_subset.bed"
export.bed(keepStandardChromosomes(import.bed('your_table.bed'),pruning.mode='coarse'),file='your_table_subset.bed')

ADD COMMENT • link 5.7 years ago by benformatics 3.9k

score 5 · Accepted Answer · 2018-08-06

5

Entering edit mode

5.7 years ago

Friederike 8.9k

command line:

egrep "^[0-9XY]" file

EDIT: originally, this used the same regex as for the R-based example (^[0-9XY]$). This won't work because the full line of the text file contains more characters (such as the coordinates...). Thanks to Alex for pointing this out.

R (because this post is tagged with it):

# assuming your data frame with the coordinates looks like this
df <- data.frame(chr = c("1","2","X","Y", "GL000220.1"),
                start = c(1,20,30,40,50),
                 end = c(11, 21, 31, 41, 51)
)

subset(df, grepl("^[0-9XY]$", chr))

ADD COMMENT • link 5.7 years ago by Friederike 8.9k

0

Entering edit mode

I don't think this works? For example:

$ echo -e '1\n11\n22\n2' | grep -E "^[0-9XY]$"
1
2

Perhaps you might want the following:

$ echo -e '1\n11\n22\nP\n\XY\n2\nZ\nX' | grep -E "^[0-9]{1,2}$|^[XY]$"
1
11
22
2
X

ADD REPLY • link 5.7 years ago by Alex Reynolds 35k

0

Entering edit mode

yes, you're right, the dollar sign in the original command I posted was silly (only makes sense for the R-based command)

ADD REPLY • link 5.7 years ago by Friederike 8.9k

score 5 · Accepted Answer · 2018-08-06

$ sed -n '/^[0-9,X,Y]/Ip' test.txt

or

$ sed -n '/^[^A-WZ]/Ip' test.txt

1   141009669   141009952
9   141016322   141016973
X   200983  202660
y   205180  205702

with tsv-utils:

$  tsv-filter  --iregex  '1:^[^A-WZ]' test.txt

1   141009669   141009952
9   141016322   141016973
X   200983  202660
y   205180  205702

score 4 · Accepted Answer · 2018-08-06

A straight forward way:

 mainChr = c(as.character(1:22),'x','X','y','Y')
 data = read.delim('your.bed',stringsAsFactor = F,header = F)
 data_with_mainChr = data[data$V1 %in% mainChr,]

if you want to use readr and dplyr which is more efficient when dealing with big files:

mainChr = c(as.character(1:22),'x','X','y','Y')
library(dplyr); library(readr)
data = read_tsv('your_bed_file',col_names = F)
data_with_mainChr = dplyr::filter(data, X1 %in% mainChr)

score 2 · Accepted Answer · 2018-08-06

2

Entering edit mode

5.7 years ago

marina.v.yurieva ▴ 570

you can do the opposite

awk '{print $1}' file | sort | uniq

to find the rows you want to exclude and then exclude them with grep

grep -v -e 'pattern1' -e 'pattern2'

ADD COMMENT • link 5.7 years ago by marina.v.yurieva ▴ 570