Question: Replacing the Chr names and position notions in vcf
1
gravatar for umermehar10
2.6 years ago by
umermehar1010
umermehar1010 wrote:

Dear All I have a combined VCF file of few individuals. In this VCF instead of having normal CHR1, chr2 notions for chromosomes it is having the chromosome information as

gi|996703411|ref|NW_015379183.1|, gi|996703411|ref|NW_015379175.1

In this notion NW_015379183.1 corresponds to a specific Chromosome. The same is true for its positions, If I have the chromosome numbers for all gi|996703411|ref|NW_015379183.1| sort of notions how I can replace the chromosome names to the original names.

ADD COMMENTlink modified 2.6 years ago by Bastien Hervé4.8k • written 2.6 years ago by umermehar1010

see VCF files: Change Chromosome Notation

ADD REPLYlink written 2.6 years ago by Pierre Lindenbaum130k
4
gravatar for Pierre Lindenbaum
2.6 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum130k wrote:

use bcftools annotate https://samtools.github.io/bcftools/bcftools.html

with

--rename-chrs file rename chromosomes according to the map in file, with "old_name new_name\n" pairs separated by whitespaces, each on a separate line.

ADD COMMENTlink written 2.6 years ago by Pierre Lindenbaum130k
1
gravatar for Bastien Hervé
2.6 years ago by
Bastien Hervé4.8k
Karolinska Institutet, Sweden
Bastien Hervé4.8k wrote:

As said above bcftools will be the best way to do it. If you want a code version here is mine

Assuming that you have a match_table.txt file like this (separate by tab) :

gi|996703411|ref|NW_015379183.1| chr1

gi|996703411|ref|NW_015379184.1| chr2

gi|996703411|ref|NW_015379185.1| chr3

gi|996703411|ref|NW_015379186.1| chr4

gi|996703411|ref|NW_015379187.1| chr5

Coding version in python :

###Create a dictionnary containing your match_table.txt
match_dict={}
###Open your match table
with open("matching_table.txt", 'r') as match_f:
    ###For each line, you create a key/value item in a dictionnary
    for line in match_f:
        gi_notation = line.rstrip().split("\t")[0]
        chr_notation = line.rstrip().split("\t")[1]
        ###Check if the key doesn't exist in the dictionnary
        if gi_notation not in match_dict:
            match_dict[gi_notation] = chr_notation
        else:
            print("Care, duplicate in matching_table.txt, on : "+str(gi_notation))

###Open your vcf file
new_vcf_file = open("your_new_vcf_file.vcf", "a")
with open("your_vcf_file.vcf", 'r') as vcf_f:
    ###Read it line by line
    headers_chromosome = ""
    for line in vcf_f:
        ###Change VCF dictionnary headers
        if line.startswith('##contig'):
            ###Get chromosome name
            headers_chromosome = line.split("=")[2].split(",")[0]
        ###If your chromosome exist in your dictionnary
        if headers_chromosome in match_dict:
            ###Replace in chromosome name in line
            line = line.replace(headers_chromosome, match_dict[headers_chromosome])
        ###Skip metadata informations
        if line[0] != '#':
            ###Retrieve your chromosome for each line
            current_chromosome = line.split("\t")[0]
            ###If your chromosome exist in your dictionnary
            if current_chromosome in match_dict:
                ###Change the value of your chromosome
                new_vcf_file.write(match_dict[current_chromosome]+"\t"+'\t'.join(line.split("\t")[1:]))
                ###Your chromosome is not in your dictionnary (I write it as it is but you can do something else...)
            else:
                print("This chromosome is not in my matching_table.txt : "+str(current_chromosome))
                new_vcf_file.write(line)
        ###Write unchanged metadata
        else:
            new_vcf_file.write(line)
new_vcf_file.close()
ADD COMMENTlink modified 2.6 years ago • written 2.6 years ago by Bastien Hervé4.8k

Dear I am trying with the script you have generously created for me, But I am receiving the following error IndentationError: unexpected indent

    Chr1 = line.rstrip().split("\t")[1]
      

although I am trying to maintain the correct indentation in command

for line in match_f:
    gi_notation = line.rstrip().split("\t")[0]
    chr_notation = line.rstrip().split("\t")[1]
ADD REPLYlink written 2.6 years ago by umermehar1010

I found it, it works well for me now. Note that, it is easier to find an error by providing the error message and the line involved. As Ram said, if you want to go deeper in file manipulation, you have to try coding by your own. Thereby, I let you some comments in the code '###' to have a better understanding on the process. This code will only work for what you ask but you can reuse some line to write a new script, like opening a file (with open("your_vcf_file.vcf", 'r') as vcf_f) or loop on the different lines (for line in match_f)

ADD REPLYlink modified 2.6 years ago • written 2.6 years ago by Bastien Hervé4.8k
0
gravatar for RamRS
2.6 years ago by
RamRS30k
Baylor College of Medicine, Houston, TX
RamRS30k wrote:

Create a file with 2 columns - the gi| notation and the notation you want.

Read the above file to a dictionary. Then, read the VCF file in a streaming fashion and substitute each occurrence of the old notation with the new one.

This is the strategy. I'd recommend you coding this yourself so you can use it as a learning experience.

ADD COMMENTlink written 2.6 years ago by RamRS30k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1619 users visited in the last hour