Question

Contig name difference due to reference genome

0

Entering edit mode

4.8 years ago

nuketbilgen ▴ 40

Hi everyone,

I have vcf files of 4 feline genomes, but in vcf header I see different contig names. I checked the reference genome file line, you can see it below.

reference=file:///ifswh1/BC_COM_P1/F18FTSEUHT0898/CATsxlR/analysis/index/GCF_000181335.3_Felis_catus_9.0_genomic.fa
reference=file:///ifshk5/BC_AS/BC_COM_P0/F19FTSEUHT0354/CATbelR/2016/result/index/felCat9.fa

Two of my genomes aligned to the first one, the other two aligned to the second one. I want to merge this vcfs and run an LD analysis but I can not.

How can I solve this? Thanks...

next-gen genome alignment • 1.3k views

ADD COMMENT • link 4.8 years ago by nuketbilgen ▴ 40

0

Entering edit mode

Are they the same genome builds?

ADD REPLY • link 4.8 years ago by GenoMax 141k

0

Entering edit mode

A quick Google-search yielded: felCat9.fa (UCSC Genome Browser) and GCF_000181335.3_Felis_catus_9.0_genomic.fa (NCBI)

ADD REPLY • link 4.8 years ago by jean.elbers ★ 1.7k

0

Entering edit mode

exactly yes. When I split vcf files into chr by SnpSift split command, I got 40 files for felcat9.fa aligned files, and I got 426 files for NCBI one. I worry to lose important variants...

ADD REPLY • link 4.8 years ago by nuketbilgen ▴ 40

0

Entering edit mode

I think the biostar community needs more information to your post to help you, such as how the VCF files were produced. If the only difference is in naming, then a quick regular expression or search and replace command can replace the column 1 value from an old, undesired name to a new, desired name.

perl -pe "s/oldname/newname/g" input.vcf > output.vcf

Note that this above command assumes that oldname only occurs in the column1 of the VCF file.

ADD REPLY • link 4.8 years ago by jean.elbers ★ 1.7k

0

Entering edit mode

Hi again, vcf files generated by GATK haplotypecaller walker. Haplotype Calling java -jar GenomeAnalysisTK-3.3-0/GenomeAnalysisTK.jar -T HaplotypeCaller -R all.chrs.con.fa -L TEST_Chr01 -I aligned_reads.sorted.dedup.bam --emitRefConfidence GVCF --variant_index_type LINEAR -- variant_index_parameter 128000 -o TEST_Chr01.gvcf

You can find the examples of the contig lines below. These contigs also have variations, and if file has variation on "contig=ID=chrA1_NW_019365239v1_random,length=46965>" same variation is located on "contig=<id=chra1_random,length=415283>" for the other two files. So the chr naming on the same positioned SNPs are different as well...

First two files contig example;

contig=ID=chrA1,length=242100913>

contig=ID=chrA1_random,length=415283>

contig=ID=chrA2,length=171471747>

contig=ID=chrA2_random,length=1187422>

Other two files contig example;

contig=ID=chrA1,length=242100913>

contig=ID=chrA1_NW_019365239v1_random,length=46965>

contig=ID=chrA1_NW_019365240v1_random,length=58068>

contig=ID=chrA1_NW_019365241v1_random,length=50743>

contig=ID=chrA1_NW_019365242v1_random,length=22574>

contig=ID=chrA1_NW_019365243v1_random,length=50951>

contig=ID=chrA1_NW_019365244v1_random,length=50765>

contig=ID=chrA1_NW_019365245v1_random,length=14920>

contig=ID=chrA1_NW_019365246v1_random,length=45003>

contig=ID=chrA1_NW_019365247v1_random,length=40320>

contig=ID=chrA1_NW_019365248v1_random,length=25974>

contig=ID=chrA2,length=171471747> . . .

ADD REPLY • link 4.8 years ago by nuketbilgen ▴ 40

0

Entering edit mode

I know its a long shot, but would you suggest that I merge the files according to their chrs? like this?

I=PasaHardFiltered.chrA1_NW_019365239v1_random.vcf I=PasaHardFiltered.chrA1_NW_019365240v1_random.vcf I=PasaHardFiltered.chrA1_NW_019365241v1_random.vcf I=PasaHardFiltered.chrA1_NW_019365243v1_random.vcf I=PasaHardFiltered.chrA1_NW_019365244v1_random.vcf I=PasaHardFiltered.chrA1_NW_019365246v1_random.vcf I=PasaHardFiltered.chrA1_NW_019365247v1_random.vcf I=PasaHardFiltered.chrA1_NW_019365248v1_random.vcf O=PasaHardFilteredchrA1random.vcf

ADD REPLY • link updated 4.8 years ago by GenoMax 141k • written 4.8 years ago by nuketbilgen ▴ 40