conditional replacing rows with 9
2
0
Entering edit mode
7.1 years ago
Ana ▴ 200

I have a directory containing nearly 11 million small SNPs files: like this

wa_filtering_DP15_good_pops_snps_file_1
wa_filtering_DP15_good_pops_snps_file_2
.
.
.
wa_filtering_DP15_good_pops_snps_file_11232111

and each file has only 2 rows (first row allele count for the wild allele and second row allele count for the alternative allele) and 315 columns looks like this:

1   0   0   0   0   0   0   0   0   0   1   2   1   
0   0   0   0   0   0   0   0   0   0   0   0   0

I want to go through each file and if in each column both rows have 0 values replace them with 9 and get something like this:

1   9   9   9   9   9   9   9   9   9   1   2   1   
0   9   9   9   9   9   9   9   9   9   0   0   0

Can someone help me out to figure out how to do that? Thanks

bash text-processing • 1.5k views
ADD COMMENT
2
Entering edit mode
7.1 years ago
find . -type f -name "wa_filtering_DP15_good_pops_snps_fi*" | while read F;
do 
    awk 'NR==1 { split($0,a);next;} NR==2 {split($0,b);for(i=1;i<= NF;i++) printf("%s%s",(i==1?"":"\t"),a[i]==0 && b[i]==0?9:a[i]);printf("\n");;for(i=1;i<= NF;i++) printf("%s%s",(i==1?"":"\t"),a[i]==0 && b[i]==0?9:b[i]);printf("\n");} ' $F > "$F.new"
done
ADD COMMENT
0
Entering edit mode

Thanks so much @Pierre Lindenbaum, your code worked as well

ADD REPLY
2
Entering edit mode
7.1 years ago
st.ph.n ★ 2.7k
#!/usr/bin/env python
import sys
print 'File Number: ' + sys.argv[1].split('_')[-1], '\r', 
with open(sys.argv[1], 'r') as f:
    x = next(f).strip().split('\t')
    y = next(f).strip().split('\t')

with open(sys.argv[1] + '.w9', 'w') as out:
    for n in range(len(x)):
        if x[n] == '0' and y[n] == '0':
            x[n] = '9'
            y[n] = '9'
    out.write('\t'.join(x))
    out.write('\n')
    out.write('\t'.join(y))

Output

1       9       9       9       9       9       9       9       9       9       1       2       1
0       9       9       9       9       9       9       9       9       9       0       0       0

save as replace_w_9.py, run as for file in wa_filtering_DP15_good_pops_snps_file_*; do python replace_w_9.py $file; done

ADD COMMENT
0
Entering edit mode

Thanks so much @st.ph.n. yes it worked. I have an additional question. I have actually a SNPs file contains the allele counts across populations of each SNP are represented by two lines in the file, with the counts of allele one on the first line and the counts for second allele on the second. The example that I showed you above is allele count of the first SNP (lines 1 and 2). At first I thought I can split files for each SNP and run your python code for each file but I think it will be very complicated. How can I apply your python code on the entire SNPs file? is there any chance to run it on the entire data for each SNP instead of splitting the entire data into small SNPs file and run it for each SNP file? Thanks so much

ADD REPLY
0
Entering edit mode

OK, I found a solution for that! I just wrote this little bash script that splits the snpfile, runs your python script for each file, merges them together as a single file and deletes split files in the end:

#!/bin/bash

##directions
ROOT_DIR=/data/sh/H/lfmm/10K_random_SNPs_good_pop_LFMM_format/
FILE_DIR=${ROOT_DIR}/prep.lfmm.geno.file.test1

## locate files
INPUT=${FILE_DIR}/BayEnv_SNPSfile_random_1.tab.table
SCRIPT=${FILE_DIR}/replace.py 
OUTPUT=${FILE_DIR}/lfmm.part1

split -l 2 -a 5 -d ${INPUT} snp_batch_

for file in snp_batch_*;
do python ${SCRIPT} $file
done

cat snp_batch_*.w9 >> ${OUTPUT}

rm -f snp_batch*
rm -f snp_batch_*.w9enter code here
ADD REPLY
0
Entering edit mode

Can you post an example with multiple snps, from the file?

ADD REPLY

Login before adding your answer.

Traffic: 913 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6