Question: how can I keep only those SNP in file "a" which match to SNPs in file "b"?
0
gravatar for Ana
16 months ago by
Ana110
Ana110 wrote:

I have a file that I have filtered my SNPs for LD (in the example below;my.filtered.snp.id). I want to keep only these SNPs in my genotype matrix (geno_snp), I figured how to do it in R but the file is so big (~7GB) can not load it in R. I want to keep those lines (the whole line including snp.id and genotype information) in the genotype matrix where snp.id matches with snp.id in my my.filtered.snp.id and delete those that are not match. Can anyone help me to know how can I do it in Perl or python?

head my.filtered.snp.id)
 Chr10_31458
 Chr10_31524
 Chr10_45901
 Chr10_102754
 Chr10_102828
 Chr10_103480

head (geno_snp)
XRQChr10_103805 NA NA NA 0 NA 0 NA NA NA NA NA 0 0
XRQChr10_103937 NA NA NA 0 NA 1 NA NA NA NA NA 0 2
XRQChr10_103990 NA NA NA 0 NA 0 NA NA NA NA NA 0 NA
genotype file row • 497 views
ADD COMMENTlink modified 16 months ago by WouterDeCoster34k • written 16 months ago by Ana110
4
gravatar for WouterDeCoster
16 months ago by
Belgium
WouterDeCoster34k wrote:

I would suggest using

grep -f my.filtered.snp.id geno_snp >tadaam.filtered.txt
ADD COMMENTlink modified 16 months ago • written 16 months ago by WouterDeCoster34k

Yes, grep can easily do the job for me. I did not try Python scripts though .... thanks guys

ADD REPLYlink written 16 months ago by Ana110
0
gravatar for DVA
16 months ago by
DVA480
United States
DVA480 wrote:

You could use dictionary in python.

#create dictionary using your filtered.id file
idDict={}
with open("my.filtered.snp.id", "r") as input:
    for line in input:
        #Assume your line ends with "\n" - if it ends with "\r\n" please change the code below accordingly.
        key = line.split("\n")[0]
        if key not in idDict:
            idDict[key]=0

#for your matrix, compare against the dictionary
with open("geno_snp","r") as matrix, open ("output","w") as matrix_output:
    for line in matrix:
        #assume your deliminator is "\t"
        SNPID = line.split("\t")[0]
        #to exclude "XRQ" --- extract the string from the 4th character
        SNPID = SNPID[3:]

        if SNPID in idDict:
            matrix_output.write(line)
ADD COMMENTlink modified 16 months ago • written 16 months ago by DVA480
1

Good to see some Python code here :) Thanks for contributing!
But what's the added value of using a dictionary? Why not just a list?
Also notice that you don't have to check if key in idDdict: duplicates don't make a difference.

My adaptation of your code:

#create list using your filtered.id file
with open("my.filtered.snp.id", "r") as input:
    idlist = [line.strip() for line in input]

#for your matrix, compare against the list
with open("geno_snp","r") as matrix, open ("output","w") as matrix_output:
    for line in matrix:
        if line.split(" ")[0][3:] in idlist:
            matrix_output.write(line)

Also, this: SNPID = line.split("\n")[0] is a bit odd. Why would you split on newlines? I don't think this line works as you think it does. You need to split on the field separator (which is looks like a space in OPs example).

ADD REPLYlink modified 16 months ago • written 16 months ago by WouterDeCoster34k

You are right about the line. I edited my answer. As for dictionary, I'm used to it to avoid any duplicated record in filtered.id file. That said, I guess even there are duplicated records, it does not matter in this case. Thank you for your reply.

ADD REPLYlink written 16 months ago by DVA480
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 642 users visited in the last hour