Question: How to sequentially replace a string with numerical value?
1
gravatar for pedroamandrade
3 months ago by
Portugal
pedroamandrade20 wrote:

Hi everyone,

I have a list of snps with string chromosome names that I want to replace with a sequential numerical value (preparing a qqman input file). For example, turn this:

chrA
chrA
chrA
chrB
chrB
chrC
chrC
chrC
chrC
...
chrZ
chrZ

into this

1
1
1
2
2
3
3
3
3
..
26
26

At the moment I am doing this with a script, using something similar to this:

sed 's/chrA/1/g'
sed 's/chrB/2/g'
sed 's/chrC/3/g'
..
sed 's/chrZ/26/g'

However, my actual genome has a large number of contigs, so this command takes a lot of time to run. Is there a more efficient way to do this?

qqman genome • 328 views
ADD COMMENTlink modified 3 months ago by Malcolm.Cook720 • written 3 months ago by pedroamandrade20
5
gravatar for shenwei356
3 months ago by
shenwei3562.8k
China
shenwei3562.8k wrote:

Here's a solution with replace subcommand (usage) of csvtk, just download the .tar.gz file, decompress and you can run :)

First you have to prepare a mapping file, which is a plain tab-separated text file, you can easily use a spreadsheet software to create and export.

$ more mapping.tsv
chrA    1
chrB    2
chrC    3

I guess the SNP data file should be a tab-delimited file too. Here's a dummy one:

$ more data.tsv 
chrA    A       z 
chrA    A       x 
chrA    A       c 
chrB    B       v 
chrB    B       d 
chrC    C       tx
chrC    C       t
chrC    C       x
chrC    C       z

Then use csvtk to edit the SNP data file:

$ ./csvtk -H -t replace -f 1 -p '(.+)' -r '{kv}' -k mapping.tsv data.tsv 
[INFO] read key-value file: mapping.tsv
[INFO] 3 pairs of key-value loaded
1       A       z
1       A       x
1       A       c
2       B       v
2       B       d
3       C       tx
3       C       t
3       C       x
3       C       z

The long-option version would be easier to understand:

./csvtk --no-header-row --tabs replace --fields 1 --pattern '(.+)' --replacement '{kv}' --kv-file mapping.tsv data.tsv

PS: this is a general method not limited to this case. sed is good for single replacement, csvtk replace -k can handle multiple replacements well, which is written in Go with good performance.

PS2: seqkit has exactly the same function to handle FASTA/Q files.

ADD COMMENTlink modified 3 months ago • written 3 months ago by shenwei3562.8k

This solution works perfectly, and it is extremely quick, thank you very much!

ADD REPLYlink written 3 months ago by pedroamandrade20
4
gravatar for Charles Plessy
3 months ago by
Charles Plessy1.6k
Japan
Charles Plessy1.6k wrote:

Perhaps you can do it in R by converting to factor coercing it to numerical values ?

> l <- c("chrA", "chrA", "chrA", "chrB", "chrB", "chrC", "chrC", "chrC", "chrC")
> as.numeric(factor(l))
[1] 1 1 1 2 2 3 3 3 3
ADD COMMENTlink written 3 months ago by Charles Plessy1.6k

now that I've taken a look at what is qqman, I like this approach better than any other, including mine, since qqman is an R pipeline, and this is an R solution. I'd go with something like:

gwasResults$CHR=as.numeric(factor(gwasResults$CHR))
ADD REPLYlink written 3 months ago by Malcolm.Cook720
1
gravatar for bruce.moran
3 months ago by
bruce.moran180
Ireland
bruce.moran180 wrote:

Hi,

I would use command line Perl for this, specifically for the 'ord' function:

perl -ane '$F[0]=~s/chr//;$o=ord($F[0])-64;print "$o\n";' <input.file>

You still have to sed the "chr", then find the ord of that, then subtract 64 as the ord returns the ASCII numeric value.

Hope that helps.

ADD COMMENTlink written 3 months ago by bruce.moran180
1
gravatar for biocyberman
3 months ago by
biocyberman620
Denmark
biocyberman620 wrote:

Bash way if you prefer. Here are some cool bash tricks for you.

chr=( chr{A..Z} ) # make an array for chromosomes.
# now loop through the array to build sed command
sedcmd=""
for i in  {1..26}; do
 sedcmd=$sedcmd"s/${chr[$i]}/$i/;" # build up 26 sed replacement command :)
done
# run sed
sed -e "$sedcmd" inputfile > outputfile

Because sed works line by line and you may want to replace only the first match in the line so no need for the g option.

ADD COMMENTlink modified 3 months ago • written 3 months ago by biocyberman620
1
gravatar for Malcolm.Cook
3 months ago by
Malcolm.Cook720
kansas, usa
Malcolm.Cook720 wrote:

I'd say the problem is underspecified....

  • it looks like the chromosomes are all in order in the example data, but we are not told that this is guaranteed. Can we really depend on this? For instance, might "chrA" pop up again after seeing "chrB"?
  • it looks like they all begin with "chr" followed by a single letter, but is that really true of the actual data? The OP writes that there "a large number of contigs". I'm guess this is more than the 26 in the example data.
  • it looks like the list of chromosomes is "known" up front. It it really?
  • it looks like the input data has only a single column that needs to be mapped into integers. Really?!?

Here is a perl one-liner that only makes the last assumption... it builds the name-to-integer mapping "on the fly" in a data driven way.

perl -lpe '$_ = $x{$_} //= keys(%x)'  chom_names.txt > chrom_nums.txt

It uses a few "tricky" aspects of Perl, but they are worth studying and knowing.

The following modification avoids making that last assumption. It uses a few more "tricky" aspects. You can provide which column contains the chromosome in switch -col, and provide an output delimiter (tab, in this case) in switch -delim

 perl -slape 'BEGIN{$" = $delim} $F[$col] = $x{$F[$col]} ||= keys(%x); $_ = "@F"'  -- -col=2 -delim=$'\t' chom_names.txt > chrom_nums.txt

Good luck!

ADD COMMENTlink modified 3 months ago • written 3 months ago by Malcolm.Cook720

Nice ! Maybe perl -nE 'say $x{$_} //= keys(%x)' would be slightly more modern than perl -lpe '$_=$x{$_}||=keys(%x)'.

ADD REPLYlink modified 3 months ago • written 3 months ago by Charles Plessy1.6k

Erhm, you caught me! I've not yet made the jump to Perl 6. Do you advise it? Most of what I used to do in Perl I now do in either bash, R, or python anyway... hmmm...

ADD REPLYlink written 3 months ago by Malcolm.Cook720

Actually, the "modern" version above is plain perl5 in which all the Perl features have been enabled by using -E instead of -e. Here, it is used for the say function. In Perl 6, the one-liner would look like perl6 -ne 'state %x ; say %x{$_} //= %x.elems + 1'. I am still very much a beginner in Perl 6, but it looks interesting.

ADD REPLYlink written 3 months ago by Charles Plessy1.6k

Thanks for the education. And, I'm editing my old-school answer to adopt your use of //= instead of ||=, which arguably is preferable.

ADD REPLYlink written 3 months ago by Malcolm.Cook720

Despite of Perl6, th best improvement for Perl is Python :) Perl6 is a late "make up" of an old time champion. Just personal opinion.

ADD REPLYlink written 9 weeks ago by biocyberman620
0
gravatar for pedroamandrade
3 months ago by
Portugal
pedroamandrade20 wrote:

Thank you for all the quick responses. The solution by shenwei356 already worked, but I'll try the other ones and give feedback

ADD COMMENTlink written 3 months ago by pedroamandrade20
0
gravatar for Pierre Lindenbaum
3 months ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum92k wrote:

sed can read a file of multiple patterns:

sed -f pattern in.txt > out.txt

https://linux.die.net/man/1/sed


   -f script-file, --file=script-file
     add the contents of script-file to the commands to be executed
ADD COMMENTlink written 3 months ago by Pierre Lindenbaum92k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 915 users visited in the last hour