Question: Extract only rows with main chromosomes (1-22, X, Y) on first column?
0
gravatar for star
14 months ago by
star190
Netherlands
star190 wrote:

I have a table like below, it is a bed file of genome coordinate, I would like to keep only rows with numbers.

Input:

1           141009669   141009952
9           141016322   141016973
GL000195.1  81719   82468
GL000195.1  142613  142923
GL000220.1  119445  119746
HG115_PATCH 101957832   101958132
HG1308_PATCH 130205069  130205369
HG1308_PATCH 130205406  130205773
HG748_PATCH  77577953   77578264
X            200983 202660
y         205180    205702

output:

1           141009669   141009952
9           141016322   141016973
X            200983 202660
y            205180 205702

Thanks in advance!

linux script grep R • 947 views
ADD COMMENTlink modified 13 months ago by benformatics1.1k • written 14 months ago by star190

x and y are not numbers but I get what you are asking for. You only want to keep main chromosomes?

ADD REPLYlink written 14 months ago by genomax73k

yes, exactly, I want just main chromosomes.

ADD REPLYlink written 14 months ago by star190
1

Tell us why you have tried so far? If you are interested in fixing your own attempt.

ADD REPLYlink modified 14 months ago • written 14 months ago by genomax73k
5
gravatar for Friederike
14 months ago by
Friederike5.2k
United States
Friederike5.2k wrote:

command line:

egrep "^[0-9XY]" file

EDIT: originally, this used the same regex as for the R-based example (^[0-9XY]$). This won't work because the full line of the text file contains more characters (such as the coordinates...). Thanks to Alex for pointing this out.

R (because this post is tagged with it):

# assuming your data frame with the coordinates looks like this
df <- data.frame(chr = c("1","2","X","Y", "GL000220.1"),
                start = c(1,20,30,40,50),
                 end = c(11, 21, 31, 41, 51)
)

subset(df, grepl("^[0-9XY]$", chr))
ADD COMMENTlink modified 14 months ago • written 14 months ago by Friederike5.2k

I don't think this works? For example:

$ echo -e '1\n11\n22\n2' | grep -E "^[0-9XY]$"
1
2

Perhaps you might want the following:

$ echo -e '1\n11\n22\nP\n\XY\n2\nZ\nX' | grep -E "^[0-9]{1,2}$|^[XY]$"
1
11
22
2
X
ADD REPLYlink modified 14 months ago • written 14 months ago by Alex Reynolds29k

yes, you're right, the dollar sign in the original command I posted was silly (only makes sense for the R-based command)

ADD REPLYlink written 14 months ago by Friederike5.2k
3
gravatar for cpad0112
14 months ago by
cpad011212k
India
cpad011212k wrote:
$ sed -n '/^[0-9,X,Y]/Ip' test.txt

or

$ sed -n '/^[^A-WZ]/Ip' test.txt

1   141009669   141009952
9   141016322   141016973
X   200983  202660
y   205180  205702

with tsv-utils:

$  tsv-filter  --iregex  '1:^[^A-WZ]' test.txt

1   141009669   141009952
9   141016322   141016973
X   200983  202660
y   205180  205702
ADD COMMENTlink modified 14 months ago • written 14 months ago by cpad011212k
2
gravatar for marina.v.yurieva
14 months ago by
Farmington, CT
marina.v.yurieva480 wrote:

you can do the opposite

awk '{print $1}' file | sort | uniq

to find the rows you want to exclude and then exclude them with grep

grep -v -e 'pattern1' -e 'pattern2'
ADD COMMENTlink written 14 months ago by marina.v.yurieva480
2
gravatar for ewre
14 months ago by
ewre220
United States
ewre220 wrote:

A straight forward way:

 mainChr = c(as.character(1:22),'x','X','y','Y')
 data = read.delim('your.bed',stringsAsFactor = F,header = F)
 data_with_mainChr = data[data$V1 %in% mainChr,]

if you want to use readr and dplyr which is more efficient when dealing with big files:

mainChr = c(as.character(1:22),'x','X','y','Y')
library(dplyr); library(readr)
data = read_tsv('your_bed_file',col_names = F)
data_with_mainChr = dplyr::filter(data, X1 %in% mainChr)
ADD COMMENTlink written 14 months ago by ewre220
0
gravatar for benformatics
13 months ago by
benformatics1.1k
ETH Zurich
benformatics1.1k wrote:

You can almost do this in one line with R and GenomicFeatures + rtracklayer

library(GenomicFeatures)
library(rtracklayer)

## read your table as a random text file
keepStandardChromosomes(GRanges(read.table('your_table.bed',col.names=c('chr','start','stop'))),pruning.mode='coarse')

## read your table in as a bed - if it really is a bonafide .bed then you can simplify a bit 
keepStandardChromosomes(import.bed('your_table.bed'),pruning.mode='coarse')

## do the same thing as above but simultaneously save it as a new bed file named "your_table_subset.bed"
export.bed(keepStandardChromosomes(import.bed('your_table.bed'),pruning.mode='coarse'),file='your_table_subset.bed')
ADD COMMENTlink modified 13 months ago • written 13 months ago by benformatics1.1k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 697 users visited in the last hour