Question: How to sequentially replace a string with numerical value?
1
gravatar for pedroamandrade
10 months ago by
Portugal
pedroamandrade40 wrote:

Hi everyone,

I have a list of snps with string chromosome names that I want to replace with a sequential numerical value (preparing a qqman input file). For example, turn this:

chrA
chrA
chrA
chrB
chrB
chrC
chrC
chrC
chrC
...
chrZ
chrZ

into this

1
1
1
2
2
3
3
3
3
..
26
26

At the moment I am doing this with a script, using something similar to this:

sed 's/chrA/1/g'
sed 's/chrB/2/g'
sed 's/chrC/3/g'
..
sed 's/chrZ/26/g'

However, my actual genome has a large number of contigs, so this command takes a lot of time to run. Is there a more efficient way to do this?

qqman genome • 479 views
ADD COMMENTlink modified 10 months ago by Malcolm.Cook770 • written 10 months ago by pedroamandrade40
5
gravatar for shenwei356
10 months ago by
shenwei3563.4k
China
shenwei3563.4k wrote:

Here's a solution with replace subcommand (usage) of csvtk, just download the .tar.gz file, decompress and you can run :)

First you have to prepare a mapping file, which is a plain tab-separated text file, you can easily use a spreadsheet software to create and export.

$ more mapping.tsv
chrA    1
chrB    2
chrC    3

I guess the SNP data file should be a tab-delimited file too. Here's a dummy one:

$ more data.tsv 
chrA    A       z 
chrA    A       x 
chrA    A       c 
chrB    B       v 
chrB    B       d 
chrC    C       tx
chrC    C       t
chrC    C       x
chrC    C       z

Then use csvtk to edit the SNP data file:

$ ./csvtk -H -t replace -f 1 -p '(.+)' -r '{kv}' -k mapping.tsv data.tsv 
[INFO] read key-value file: mapping.tsv
[INFO] 3 pairs of key-value loaded
1       A       z
1       A       x
1       A       c
2       B       v
2       B       d
3       C       tx
3       C       t
3       C       x
3       C       z

The long-option version would be easier to understand:

./csvtk --no-header-row --tabs replace --fields 1 --pattern '(.+)' --replacement '{kv}' --kv-file mapping.tsv data.tsv

PS: this is a general method not limited to this case. sed is good for single replacement, csvtk replace -k can handle multiple replacements well, which is written in Go with good performance.

PS2: seqkit has exactly the same function to handle FASTA/Q files.

ADD COMMENTlink modified 10 months ago • written 10 months ago by shenwei3563.4k

This solution works perfectly, and it is extremely quick, thank you very much!

ADD REPLYlink written 10 months ago by pedroamandrade40
4
gravatar for Charles Plessy
10 months ago by
Charles Plessy2.3k
Japan
Charles Plessy2.3k wrote:

Perhaps you can do it in R by converting to factor coercing it to numerical values ?

> l <- c("chrA", "chrA", "chrA", "chrB", "chrB", "chrC", "chrC", "chrC", "chrC")
> as.numeric(factor(l))
[1] 1 1 1 2 2 3 3 3 3
ADD COMMENTlink written 10 months ago by Charles Plessy2.3k

now that I've taken a look at what is qqman, I like this approach better than any other, including mine, since qqman is an R pipeline, and this is an R solution. I'd go with something like:

gwasResults$CHR=as.numeric(factor(gwasResults$CHR))
ADD REPLYlink written 10 months ago by Malcolm.Cook770
1
gravatar for bruce.moran
10 months ago by
bruce.moran350
Ireland
bruce.moran350 wrote:

Hi,

I would use command line Perl for this, specifically for the 'ord' function:

perl -ane '$F[0]=~s/chr//;$o=ord($F[0])-64;print "$o\n";' <input.file>

You still have to sed the "chr", then find the ord of that, then subtract 64 as the ord returns the ASCII numeric value.

Hope that helps.

ADD COMMENTlink written 10 months ago by bruce.moran350
1
gravatar for biocyberman
10 months ago by
biocyberman640
Denmark
biocyberman640 wrote:

Bash way if you prefer. Here are some cool bash tricks for you.

chr=( chr{A..Z} ) # make an array for chromosomes.
# now loop through the array to build sed command
sedcmd=""
for i in  {1..26}; do
 sedcmd=$sedcmd"s/${chr[$i]}/$i/;" # build up 26 sed replacement command :)
done
# run sed
sed -e "$sedcmd" inputfile > outputfile

Because sed works line by line and you may want to replace only the first match in the line so no need for the g option.

ADD COMMENTlink modified 10 months ago • written 10 months ago by biocyberman640
1
gravatar for Malcolm.Cook
10 months ago by
Malcolm.Cook770
kansas, usa
Malcolm.Cook770 wrote:

I'd say the problem is underspecified....

  • it looks like the chromosomes are all in order in the example data, but we are not told that this is guaranteed. Can we really depend on this? For instance, might "chrA" pop up again after seeing "chrB"?
  • it looks like they all begin with "chr" followed by a single letter, but is that really true of the actual data? The OP writes that there "a large number of contigs". I'm guess this is more than the 26 in the example data.
  • it looks like the list of chromosomes is "known" up front. It it really?
  • it looks like the input data has only a single column that needs to be mapped into integers. Really?!?

Here is a perl one-liner that only makes the last assumption... it builds the name-to-integer mapping "on the fly" in a data driven way.

perl -lpe '$_ = $x{$_} //= keys(%x)'  chom_names.txt > chrom_nums.txt

It uses a few "tricky" aspects of Perl, but they are worth studying and knowing.

The following modification avoids making that last assumption. It uses a few more "tricky" aspects. You can provide which column contains the chromosome in switch -col, and provide an output delimiter (tab, in this case) in switch -delim

 perl -slape 'BEGIN{$" = $delim} $F[$col] = $x{$F[$col]} ||= keys(%x); $_ = "@F"'  -- -col=2 -delim=$'\t' chom_names.txt > chrom_nums.txt

Good luck!

ADD COMMENTlink modified 10 months ago • written 10 months ago by Malcolm.Cook770

Nice ! Maybe perl -nE 'say $x{$_} //= keys(%x)' would be slightly more modern than perl -lpe '$_=$x{$_}||=keys(%x)'.

ADD REPLYlink modified 10 months ago • written 10 months ago by Charles Plessy2.3k

Erhm, you caught me! I've not yet made the jump to Perl 6. Do you advise it? Most of what I used to do in Perl I now do in either bash, R, or python anyway... hmmm...

ADD REPLYlink written 10 months ago by Malcolm.Cook770

Actually, the "modern" version above is plain perl5 in which all the Perl features have been enabled by using -E instead of -e. Here, it is used for the say function. In Perl 6, the one-liner would look like perl6 -ne 'state %x ; say %x{$_} //= %x.elems + 1'. I am still very much a beginner in Perl 6, but it looks interesting.

ADD REPLYlink written 10 months ago by Charles Plessy2.3k

Thanks for the education. And, I'm editing my old-school answer to adopt your use of //= instead of ||=, which arguably is preferable.

ADD REPLYlink written 10 months ago by Malcolm.Cook770

Despite of Perl6, th best improvement for Perl is Python :) Perl6 is a late "make up" of an old time champion. Just personal opinion.

ADD REPLYlink written 9 months ago by biocyberman640
0
gravatar for pedroamandrade
10 months ago by
Portugal
pedroamandrade40 wrote:

Thank you for all the quick responses. The solution by shenwei356 already worked, but I'll try the other ones and give feedback

ADD COMMENTlink written 10 months ago by pedroamandrade40
0
gravatar for Pierre Lindenbaum
10 months ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum101k wrote:

sed can read a file of multiple patterns:

sed -f pattern in.txt > out.txt

https://linux.die.net/man/1/sed


   -f script-file, --file=script-file
     add the contents of script-file to the commands to be executed
ADD COMMENTlink written 10 months ago by Pierre Lindenbaum101k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 798 users visited in the last hour