Question: Replace <*> column with nucleotide in python or R or Shell
1
gravatar for Kritika
21 months ago by
Kritika260
India
Kritika260 wrote:

I have data of 55000000 rows i want to replace <*> with preceding 3rd column nucleotide.

Format of data is

12  1109770 C   <*>
12  1109771 T   <*>
12  1109772 T   <*>
12  1109773 T   <*>
12  1109774 C   <*>
12  1109775 C   <*>
12  1109776 C   A,C,C,<*>

Output

12  1109770 C   C
12  1109771 T   T
12  1109772 T   T
12  1109773 T   T
12  1109774 C   C
12  1109775 C   C
12  1109776 C   A,C,C,C
R shell replace python nucleotide • 776 views
ADD COMMENTlink modified 21 months ago by Bastien Hervé4.5k • written 21 months ago by Kritika260
1

Can you post something you have already tried?

ADD REPLYlink written 21 months ago by Sej Modha4.5k

Is your example correct for the last line ?

12 1109776 C A,C,C,<*>

This should turn to :

12 1109776 C A,C,C,C

not

12 1109776 C C,A,C,C

Am I correct ?

ADD REPLYlink written 21 months ago by Bastien Hervé4.5k

Yes... It should turn to A,C,C,C

ADD REPLYlink written 21 months ago by Kritika260

Updated the same again with actual required out put

ADD REPLYlink written 21 months ago by Kritika260

Always add some detail on the effort you put in to solving your problem.

ADD REPLYlink written 21 months ago by RamRS25k
3
gravatar for venu
21 months ago by
venu6.3k
Germany
venu6.3k wrote:

Assuming you have exactly the format you posted here, following oneliner should work

Input.txt

2       1109770 C       <*>
12      1109771 T       <*>
12      1109772 T       <*>
12      1109773 T       <*>
12      1109774 C       <*>
12      1109775 C       <*>
12      1109776 C       <*>,A,C,C

Oneliner

cat input.txt | sed '/^$/d' | sed -e 's/<//' -e 's/>//' -e 's/\*/X/' -e 's/,/\t/' | awk '{print $1 "\t" $2 "\t" $3 "\t" $3 ","$5}' | sed 's/,$//'

Output

2       1109770 C       C
12      1109771 T       T
12      1109772 T       T
12      1109773 T       T
12      1109774 C       C
12      1109775 C       C
12      1109776 C       C,A,C,C

P.S: This might not work if <*> is not at the beginning of the 4th column (tab separated).

ADD COMMENTlink written 21 months ago by venu6.3k
3
gravatar for Bastien Hervé
21 months ago by
Bastien Hervé4.5k
Limoges, CBRS, France
Bastien Hervé4.5k wrote:
awk 'BEGIN{OFS=FS="\t"}{gsub(/<\*>/,$3); print $0}' input.txt > output.txt
ADD COMMENTlink modified 21 months ago • written 21 months ago by Bastien Hervé4.5k
2

Your solution looks like something I would have written a few years ago. The cat is almost always useless, also we don't need sed here :)

awk 'BEGIN{OFS=FS="\t"}{$4=$3$4; gsub("[<*>]",""); print $0}' input > output
ADD REPLYlink modified 21 months ago • written 21 months ago by 5heikki8.6k
$ awk 'FS=OFS="\t" {gsub("[<*>]","");$4= $3$4}1' test.txt 
12  1109770 C   C
12  1109771 T   T
12  1109772 T   T
12  1109773 T   T
12  1109774 C   C
12  1109775 C   C
12  1109776 C   C,A,C,C

in bash:

$ paste  <(cut -f1-3 test.txt) <(paste -d "" <(cut -f3 test.txt) <(cut -f4 test.txt | cut --complement -c -3))
12  1109770 C   C
12  1109771 T   T
12  1109772 T   T
12  1109773 T   T
12  1109774 C   C
12  1109775 C   C
12  1109776 C   C,A,C,C
ADD REPLYlink modified 21 months ago • written 21 months ago by cpad011212k

Maybe because I am a few years younger. Thanks for the tips and gsub fonction :)

ADD REPLYlink written 21 months ago by Bastien Hervé4.5k

This is giving me output for last line :- 12 1109776 C CACC,

ADD REPLYlink written 21 months ago by Kritika260

For one of the line

12 975013 C T,A,<*>

output

12 975013 C CT,

ADD REPLYlink modified 21 months ago • written 21 months ago by Kritika260
2

Try this :

awk 'BEGIN{OFS=FS="\t"}{gsub(/<\*>/,$3); print $0}' input.txt > output.txt
ADD REPLYlink written 21 months ago by Bastien Hervé4.5k

Yes it worked now . Thank You so much !!!!

ADD REPLYlink written 21 months ago by Kritika260

If an answer was helpful you should upvote it, if the answer resolved your question you should mark it as accepted. You can (and should test all answers posted here) and accept more than one if they work.

Upvote|Bookmark|Accept

ADD REPLYlink written 21 months ago by genomax75k

Notice how your example data did not include any lines with such format

ADD REPLYlink written 21 months ago by 5heikki8.6k

Sorry updated the post

ADD REPLYlink written 21 months ago by Kritika260

in awk:

$ awk 'BEGIN{FS="\t"} {$4=$3","$4; gsub(/,<\*>/,"")}1' test.txt

in bash:

$ paste  <(cut -f1-3 test.txt) <( paste -d "," <(cut -f3 test.txt) <(cut -f4 test.txt) |  rev| cut -c 1-4 --complement | rev)

output:

12 1109770 C C
12 1109771 T T
12 1109772 T T
12 1109773 T T
12 1109774 C C
12 1109775 C C
12 1109776 C C,A,C,C

input

$ cat test.txt 
12  1109770 C   <*>
12  1109771 T   <*>
12  1109772 T   <*>
12  1109773 T   <*>
12  1109774 C   <*>
12  1109775 C   <*>
12  1109776 C   A,C,C,<*>
ADD REPLYlink modified 21 months ago • written 21 months ago by cpad011212k
1
gravatar for zx8754
21 months ago by
zx87548.8k
London
zx87548.8k wrote:

Using R, data.table package for fast read and write:

library(data.table)

# fast read using data.table package
dt1 <- fread("input.txt")

# dt1
#      V1      V2 V3        V4
#   1:  2 1109770  C       <*>
#   2: 12 1109771  T       <*>
#   3: 12 1109772  T       <*>
#   4: 12 1109773  T       <*>
#   5: 12 1109774  C       <*>
#   6: 12 1109775  C       <*>
#   7: 12 1109776  C <*>,A,C,C

# update V4, remove "<*>", prefix with V3
dt1[ , V4 := paste0(V3, gsub("<*>", "", V4, fixed = TRUE)) ]

# dt1
#    V1      V2 V3      V4
# 1:  2 1109770  C       C
# 2: 12 1109771  T       T
# 3: 12 1109772  T       T
# 4: 12 1109773  T       T
# 5: 12 1109774  C       C
# 6: 12 1109775  C       C
# 7: 12 1109776  C C,A,C,C

# fast write, without names, quotes
fwrite(dt1, file = "output.txt", sep = "\t",
       col.names = FALSE, row.names = FALSE, quote = FALSE)
ADD COMMENTlink written 21 months ago by zx87548.8k

I can't do it in R file is huge very large

ADD REPLYlink written 21 months ago by Kritika260

Using data.table package it should work. Also, your question title mentions R.

ADD REPLYlink written 21 months ago by zx87548.8k

With data frame and stringr:

library(stringr)
df=read.csv("test.txt", stringsAsFactors = F, sep = "\t", header = F)
df$V4=str_replace_all(df$V4,"<\\*>", df$V3)

df

output in R:

   > df
      V1      V2 V3      V4
    1 12 1109770  C       C
    2 12 1109771  T       T
    3 12 1109772  T       T
    4 12 1109773  T       T
    5 12 1109774  C       C
    6 12 1109775  C       C
    7 12 1109776  C C,A,C,C

input in R:

   V1      V2 V3        V4
1 12 1109770  C       <*>
2 12 1109771  T       <*>
3 12 1109772  T       <*>
4 12 1109773  T       <*>
5 12 1109774  C       <*>
6 12 1109775  C       <*>
7 12 1109776  C A,C,C,<*>
ADD REPLYlink modified 21 months ago • written 21 months ago by cpad011212k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1680 users visited in the last hour