Question: pulling out a single pseudochromosome from my vcf file
1
gravatar for truebeliever24
3 months ago by
truebeliever2420 wrote:

Hi everyone,

I am trying to extract a single chromosome from my dataset so that I can run analyses using only this sex chromosome and compare to results of similar analyses on the entire genome. However, all of my chromosomes are labeled as "Pseudochromosomes", i.e., the chromosome I want to extract and make it's own vcf file is "Pseudochromosome_Z". I read that the following command should do the trick for regularly named chromosomes:

 grep -w '^#\|^#CHROM\|^chr[Z]' selas.dp1maxmiss50maf05f_subset_.012.vcf > Z.vcf

However, how do I do this for pseudochromosomes? Thanks for bearing with my ignorance.

genome • 188 views
ADD COMMENTlink modified 3 months ago by genomax71k • written 3 months ago by truebeliever2420

grep -v will work to exclude things specified. So try adding -v to your command above.

ADD REPLYlink written 3 months ago by genomax71k

Thanks for the reply. Unfortunately, that is not working for me. When I add "-v", the new file is the same size as the original vcf. I want a new vcf file that will contain only the Z chromosome. When I try without "v", the new file is only 10kb, tiny and containing no information besides a chromosome list.

ADD REPLYlink written 3 months ago by truebeliever2420

Is your regular expression correct?

ADD REPLYlink written 3 months ago by genomax71k

I'm not entirely sure. I don't have a vcf file with chromosomes labeled that would fit as "chr". Only the vcf file with Pseudochromosomes, so I can't really test whether the original line works. I got the line from here: How to extract specific chromosome from vcf file

I just tried this: grep -v -w '^#\|^#CHROM\|^chr[Z]' selas.dp1maxmiss50maf05f_subset_.012.vcf > Z.vcf

Any idea what I'm doing wrong? I'm guessing it probably has no idea what "Z" is since I believe it's looking for just a chromosome number and not a pseudochromosome. Thanks!

ADD REPLYlink written 3 months ago by truebeliever2420

You have to use the actual names that are in your file in expression. Show the header of your vcf file and a few lines. Someone can correct the expression.

ADD REPLYlink written 3 months ago by genomax71k

I've supplied what I think is the correct information below. Would you mind helping me correcting my expression? I believe it is something close to the following:

 grep -v -w '^#\|^#CHROM\|^chr[Z]' selas.dp1maxmiss50maf05f_subset_.012.vcf > Z.vcf

Thanks!

##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##bcftoolsVersion=1.8+htslib-1.8
##bcftoolsCommand=mpileup -Ou -q 20 -a DP -r Pseudochromosome_33 -f /rhome/abrelsford/shared/bwaDb/bCalAnn1_v1.p.fasta -b bamlist.txt
##reference=file:///rhome/abrelsford/shared/bwaDb/bCalAnn1_v1.p.fasta
##contig=<ID=Pseudochromosome_1,length=197551010>
##contig=<ID=Pseudochromosome_2,length=151342139>
##contig=<ID=Pseudochromosome_3,length=114810999>
##contig=<ID=Pseudochromosome_4,length=18597117>
##contig=<ID=Pseudochromosome_4A,length=44745344>
##contig=<ID=Pseudochromosome_4B,length=23291945>
##contig=<ID=Pseudochromosome_5,length=16645885>
##contig=<ID=Pseudochromosome_5A,length=43880846>
##contig=<ID=Pseudochromosome_6,length=35401958>
##contig=<ID=Pseudochromosome_7,length=39139214>
##contig=<ID=Pseudochromosome_8,length=31090148>
##contig=<ID=Pseudochromosome_9,length=25686456>
##contig=<ID=Pseudochromosome_10,length=22664390>
##contig=<ID=Pseudochromosome_11,length=20302349>
##contig=<ID=Pseudochromosome_12,length=21352500>
##contig=<ID=Pseudochromosome_13,length=17696115>
##contig=<ID=Pseudochromosome_14,length=15497061>
##contig=<ID=Pseudochromosome_15,length=13887164>
##contig=<ID=Pseudochromosome_17,length=10473655>
##contig=<ID=Pseudochromosome_18,length=11617720>
##contig=<ID=Pseudochromosome_19,length=11166003>
##contig=<ID=Pseudochromosome_20,length=15116875>
##contig=<ID=Pseudochromosome_21,length=7710055>
##contig=<ID=Pseudochromosome_22,length=5198135>
##contig=<ID=Pseudochromosome_23,length=6423903>
##contig=<ID=Pseudochromosome_24,length=6461084>
##contig=<ID=Pseudochromosome_25,length=2161911>
##contig=<ID=Pseudochromosome_26,length=6306732>
##contig=<ID=Pseudochromosome_27,length=5749075>
##contig=<ID=Pseudochromosome_28,length=5713987>
##contig=<ID=Pseudochromosome_33,length=1888140>
##contig=<ID=Pseudochromosome_Z,length=74081004>
ADD REPLYlink modified 3 months ago by genomax71k • written 3 months ago by truebeliever2420
3
gravatar for genomax
3 months ago by
genomax71k
United States
genomax71k wrote:

While following may work,

grep -w 'Pseudochromosome_Z' your.vcf > Z.vcf

you should ideally use bcftools for this since it will preserve the headers etc. You will need to add these to VCF yourself with method above.

bcftools view input.vcf.gz --regions Pseudochromosome_Z

You will need to sort and tabix compress the vcf file. (Ref: How to extract specific chromosome from vcf file )

ADD COMMENTlink modified 3 months ago • written 3 months ago by genomax71k

Thanks! I've made some progress now! I was able to compress and index the vcf file, and this command (bcftools view input.vcf.gz --regions Pseudochromosome_Z) worked, allowing me to view the Z chromosome. How can I save a separate vcf file with only the information from the Z chromosome?

ADD REPLYlink written 3 months ago by truebeliever2420
bcftools view input.vcf.gz --regions Pseudochromosome_Z > Z.vcf
ADD REPLYlink written 3 months ago by genomax71k

Thank you for your help! This resolved the issue.

ADD REPLYlink written 3 months ago by truebeliever2420

You can accept the parent answer to provide closure to this thread.

ADD REPLYlink written 3 months ago by genomax71k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 572 users visited in the last hour