Question: pulling out a single pseudochromosome from my vcf file
1
gravatar for offspring18213
12 days ago by
offspring1821310 wrote:

Hi everyone,

I am trying to extract a single chromosome from my dataset so that I can run analyses using only this sex chromosome and compare to results of similar analyses on the entire genome. However, all of my chromosomes are labeled as "Pseudochromosomes", i.e., the chromosome I want to extract and make it's own vcf file is "Pseudochromosome_Z". I read that the following command should do the trick for regularly named chromosomes:

 grep -w '^#\|^#CHROM\|^chr[Z]' selas.dp1maxmiss50maf05f_subset_.012.vcf > Z.vcf

However, how do I do this for pseudochromosomes? Thanks for bearing with my ignorance.

genome • 109 views
ADD COMMENTlink modified 12 days ago by genomax68k • written 12 days ago by offspring1821310

grep -v will work to exclude things specified. So try adding -v to your command above.

ADD REPLYlink written 12 days ago by genomax68k

Thanks for the reply. Unfortunately, that is not working for me. When I add "-v", the new file is the same size as the original vcf. I want a new vcf file that will contain only the Z chromosome. When I try without "v", the new file is only 10kb, tiny and containing no information besides a chromosome list.

ADD REPLYlink written 12 days ago by offspring1821310

Is your regular expression correct?

ADD REPLYlink written 12 days ago by genomax68k

I'm not entirely sure. I don't have a vcf file with chromosomes labeled that would fit as "chr". Only the vcf file with Pseudochromosomes, so I can't really test whether the original line works. I got the line from here: How to extract specific chromosome from vcf file

I just tried this: grep -v -w '^#\|^#CHROM\|^chr[Z]' selas.dp1maxmiss50maf05f_subset_.012.vcf > Z.vcf

Any idea what I'm doing wrong? I'm guessing it probably has no idea what "Z" is since I believe it's looking for just a chromosome number and not a pseudochromosome. Thanks!

ADD REPLYlink written 12 days ago by offspring1821310

You have to use the actual names that are in your file in expression. Show the header of your vcf file and a few lines. Someone can correct the expression.

ADD REPLYlink written 11 days ago by genomax68k

I've supplied what I think is the correct information below. Would you mind helping me correcting my expression? I believe it is something close to the following:

 grep -v -w '^#\|^#CHROM\|^chr[Z]' selas.dp1maxmiss50maf05f_subset_.012.vcf > Z.vcf

Thanks!

##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##bcftoolsVersion=1.8+htslib-1.8
##bcftoolsCommand=mpileup -Ou -q 20 -a DP -r Pseudochromosome_33 -f /rhome/abrelsford/shared/bwaDb/bCalAnn1_v1.p.fasta -b bamlist.txt
##reference=file:///rhome/abrelsford/shared/bwaDb/bCalAnn1_v1.p.fasta
##contig=<ID=Pseudochromosome_1,length=197551010>
##contig=<ID=Pseudochromosome_2,length=151342139>
##contig=<ID=Pseudochromosome_3,length=114810999>
##contig=<ID=Pseudochromosome_4,length=18597117>
##contig=<ID=Pseudochromosome_4A,length=44745344>
##contig=<ID=Pseudochromosome_4B,length=23291945>
##contig=<ID=Pseudochromosome_5,length=16645885>
##contig=<ID=Pseudochromosome_5A,length=43880846>
##contig=<ID=Pseudochromosome_6,length=35401958>
##contig=<ID=Pseudochromosome_7,length=39139214>
##contig=<ID=Pseudochromosome_8,length=31090148>
##contig=<ID=Pseudochromosome_9,length=25686456>
##contig=<ID=Pseudochromosome_10,length=22664390>
##contig=<ID=Pseudochromosome_11,length=20302349>
##contig=<ID=Pseudochromosome_12,length=21352500>
##contig=<ID=Pseudochromosome_13,length=17696115>
##contig=<ID=Pseudochromosome_14,length=15497061>
##contig=<ID=Pseudochromosome_15,length=13887164>
##contig=<ID=Pseudochromosome_17,length=10473655>
##contig=<ID=Pseudochromosome_18,length=11617720>
##contig=<ID=Pseudochromosome_19,length=11166003>
##contig=<ID=Pseudochromosome_20,length=15116875>
##contig=<ID=Pseudochromosome_21,length=7710055>
##contig=<ID=Pseudochromosome_22,length=5198135>
##contig=<ID=Pseudochromosome_23,length=6423903>
##contig=<ID=Pseudochromosome_24,length=6461084>
##contig=<ID=Pseudochromosome_25,length=2161911>
##contig=<ID=Pseudochromosome_26,length=6306732>
##contig=<ID=Pseudochromosome_27,length=5749075>
##contig=<ID=Pseudochromosome_28,length=5713987>
##contig=<ID=Pseudochromosome_33,length=1888140>
##contig=<ID=Pseudochromosome_Z,length=74081004>
ADD REPLYlink modified 11 days ago by genomax68k • written 11 days ago by offspring1821310
3
gravatar for genomax
11 days ago by
genomax68k
United States
genomax68k wrote:

While following may work,

grep -w 'Pseudochromosome_Z' your.vcf > Z.vcf

you should ideally use bcftools for this since it will preserve the headers etc. You will need to add these to VCF yourself with method above.

bcftools view input.vcf.gz --regions Pseudochromosome_Z

You will need to sort and tabix compress the vcf file. (Ref: How to extract specific chromosome from vcf file )

ADD COMMENTlink modified 11 days ago • written 11 days ago by genomax68k

Thanks! I've made some progress now! I was able to compress and index the vcf file, and this command (bcftools view input.vcf.gz --regions Pseudochromosome_Z) worked, allowing me to view the Z chromosome. How can I save a separate vcf file with only the information from the Z chromosome?

ADD REPLYlink written 11 days ago by offspring1821310
bcftools view input.vcf.gz --regions Pseudochromosome_Z > Z.vcf
ADD REPLYlink written 11 days ago by genomax68k

Thank you for your help! This resolved the issue.

ADD REPLYlink written 5 days ago by offspring1821310

You can accept the parent answer to provide closure to this thread.

ADD REPLYlink written 5 days ago by genomax68k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1836 users visited in the last hour