Question: Replace <*> column with nucleotide in python or R or Shell
1
gravatar for Kritika
2.7 years ago by
Kritika260
India
Kritika260 wrote:

I have data of 55000000 rows i want to replace <*> with preceding 3rd column nucleotide.

Format of data is

12  1109770 C   <*>
12  1109771 T   <*>
12  1109772 T   <*>
12  1109773 T   <*>
12  1109774 C   <*>
12  1109775 C   <*>
12  1109776 C   A,C,C,<*>

Output

12  1109770 C   C
12  1109771 T   T
12  1109772 T   T
12  1109773 T   T
12  1109774 C   C
12  1109775 C   C
12  1109776 C   A,C,C,C
R shell replace python nucleotide • 929 views
ADD COMMENTlink modified 2.7 years ago by Bastien Hervé4.9k • written 2.7 years ago by Kritika260
1

Can you post something you have already tried?

ADD REPLYlink written 2.7 years ago by Sej Modha4.7k

Is your example correct for the last line ?

12 1109776 C A,C,C,<*>

This should turn to :

12 1109776 C A,C,C,C

not

12 1109776 C C,A,C,C

Am I correct ?

ADD REPLYlink written 2.7 years ago by Bastien Hervé4.9k

Yes... It should turn to A,C,C,C

ADD REPLYlink written 2.7 years ago by Kritika260

Updated the same again with actual required out put

ADD REPLYlink written 2.7 years ago by Kritika260

Always add some detail on the effort you put in to solving your problem.

ADD REPLYlink written 2.7 years ago by _r_am30k
3
gravatar for venu
2.7 years ago by
venu6.7k
Germany
venu6.7k wrote:

Assuming you have exactly the format you posted here, following oneliner should work

Input.txt

2       1109770 C       <*>
12      1109771 T       <*>
12      1109772 T       <*>
12      1109773 T       <*>
12      1109774 C       <*>
12      1109775 C       <*>
12      1109776 C       <*>,A,C,C

Oneliner

cat input.txt | sed '/^$/d' | sed -e 's/<//' -e 's/>//' -e 's/\*/X/' -e 's/,/\t/' | awk '{print $1 "\t" $2 "\t" $3 "\t" $3 ","$5}' | sed 's/,$//'

Output

2       1109770 C       C
12      1109771 T       T
12      1109772 T       T
12      1109773 T       T
12      1109774 C       C
12      1109775 C       C
12      1109776 C       C,A,C,C

P.S: This might not work if <*> is not at the beginning of the 4th column (tab separated).

ADD COMMENTlink written 2.7 years ago by venu6.7k
3
gravatar for Bastien Hervé
2.7 years ago by
Bastien Hervé4.9k
Karolinska Institutet, Sweden
Bastien Hervé4.9k wrote:
awk 'BEGIN{OFS=FS="\t"}{gsub(/<\*>/,$3); print $0}' input.txt > output.txt
ADD COMMENTlink modified 2.7 years ago • written 2.7 years ago by Bastien Hervé4.9k
2

Your solution looks like something I would have written a few years ago. The cat is almost always useless, also we don't need sed here :)

awk 'BEGIN{OFS=FS="\t"}{$4=$3$4; gsub("[<*>]",""); print $0}' input > output
ADD REPLYlink modified 2.7 years ago • written 2.7 years ago by 5heikki9.0k
$ awk 'FS=OFS="\t" {gsub("[<*>]","");$4= $3$4}1' test.txt 
12  1109770 C   C
12  1109771 T   T
12  1109772 T   T
12  1109773 T   T
12  1109774 C   C
12  1109775 C   C
12  1109776 C   C,A,C,C

in bash:

$ paste  <(cut -f1-3 test.txt) <(paste -d "" <(cut -f3 test.txt) <(cut -f4 test.txt | cut --complement -c -3))
12  1109770 C   C
12  1109771 T   T
12  1109772 T   T
12  1109773 T   T
12  1109774 C   C
12  1109775 C   C
12  1109776 C   C,A,C,C
ADD REPLYlink modified 2.7 years ago • written 2.7 years ago by cpad011214k

Maybe because I am a few years younger. Thanks for the tips and gsub fonction :)

ADD REPLYlink written 2.7 years ago by Bastien Hervé4.9k

This is giving me output for last line :- 12 1109776 C CACC,

ADD REPLYlink written 2.7 years ago by Kritika260

For one of the line

12 975013 C T,A,<*>

output

12 975013 C CT,

ADD REPLYlink modified 2.7 years ago • written 2.7 years ago by Kritika260
2

Try this :

awk 'BEGIN{OFS=FS="\t"}{gsub(/<\*>/,$3); print $0}' input.txt > output.txt
ADD REPLYlink written 2.7 years ago by Bastien Hervé4.9k

Yes it worked now . Thank You so much !!!!

ADD REPLYlink written 2.7 years ago by Kritika260

If an answer was helpful you should upvote it, if the answer resolved your question you should mark it as accepted. You can (and should test all answers posted here) and accept more than one if they work.

Upvote|Bookmark|Accept

ADD REPLYlink written 2.7 years ago by genomax92k

Notice how your example data did not include any lines with such format

ADD REPLYlink written 2.7 years ago by 5heikki9.0k

Sorry updated the post

ADD REPLYlink written 2.7 years ago by Kritika260

in awk:

$ awk 'BEGIN{FS="\t"} {$4=$3","$4; gsub(/,<\*>/,"")}1' test.txt

in bash:

$ paste  <(cut -f1-3 test.txt) <( paste -d "," <(cut -f3 test.txt) <(cut -f4 test.txt) |  rev| cut -c 1-4 --complement | rev)

output:

12 1109770 C C
12 1109771 T T
12 1109772 T T
12 1109773 T T
12 1109774 C C
12 1109775 C C
12 1109776 C C,A,C,C

input

$ cat test.txt 
12  1109770 C   <*>
12  1109771 T   <*>
12  1109772 T   <*>
12  1109773 T   <*>
12  1109774 C   <*>
12  1109775 C   <*>
12  1109776 C   A,C,C,<*>
ADD REPLYlink modified 2.7 years ago • written 2.7 years ago by cpad011214k
1
gravatar for zx8754
2.7 years ago by
zx87549.7k
London
zx87549.7k wrote:

Using R, data.table package for fast read and write:

library(data.table)

# fast read using data.table package
dt1 <- fread("input.txt")

# dt1
#      V1      V2 V3        V4
#   1:  2 1109770  C       <*>
#   2: 12 1109771  T       <*>
#   3: 12 1109772  T       <*>
#   4: 12 1109773  T       <*>
#   5: 12 1109774  C       <*>
#   6: 12 1109775  C       <*>
#   7: 12 1109776  C <*>,A,C,C

# update V4, remove "<*>", prefix with V3
dt1[ , V4 := paste0(V3, gsub("<*>", "", V4, fixed = TRUE)) ]

# dt1
#    V1      V2 V3      V4
# 1:  2 1109770  C       C
# 2: 12 1109771  T       T
# 3: 12 1109772  T       T
# 4: 12 1109773  T       T
# 5: 12 1109774  C       C
# 6: 12 1109775  C       C
# 7: 12 1109776  C C,A,C,C

# fast write, without names, quotes
fwrite(dt1, file = "output.txt", sep = "\t",
       col.names = FALSE, row.names = FALSE, quote = FALSE)
ADD COMMENTlink written 2.7 years ago by zx87549.7k

I can't do it in R file is huge very large

ADD REPLYlink written 2.7 years ago by Kritika260

Using data.table package it should work. Also, your question title mentions R.

ADD REPLYlink written 2.7 years ago by zx87549.7k

With data frame and stringr:

library(stringr)
df=read.csv("test.txt", stringsAsFactors = F, sep = "\t", header = F)
df$V4=str_replace_all(df$V4,"<\\*>", df$V3)

df

output in R:

   > df
      V1      V2 V3      V4
    1 12 1109770  C       C
    2 12 1109771  T       T
    3 12 1109772  T       T
    4 12 1109773  T       T
    5 12 1109774  C       C
    6 12 1109775  C       C
    7 12 1109776  C C,A,C,C

input in R:

   V1      V2 V3        V4
1 12 1109770  C       <*>
2 12 1109771  T       <*>
3 12 1109772  T       <*>
4 12 1109773  T       <*>
5 12 1109774  C       <*>
6 12 1109775  C       <*>
7 12 1109776  C A,C,C,<*>
ADD REPLYlink modified 2.7 years ago • written 2.7 years ago by cpad011214k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1029 users visited in the last hour