fill-from-fasta doesnt fill missing values in REF
1
0
Entering edit mode
3 months ago
Serij´s • 0

Hi,

I'm encountering an issue with multiple gVCF files, all of which contain "N"-nucleotides in the REF column. In an attempt to resolve this, I used the fill-from-fasta plugin-function from bcftools with the following command:

bcftools +fill-from-fasta input_vcf.g.vcf.gz -- -c REF -f path/to/fasta_file/Homo_sapiens_assembly38.fasta

GATK was used to generate the gvcf-files. Thats why I downloaded and used the fasta-file from the GATK-github-repository.

I generated the gVCF files using GATK, and I downloaded the fasta file from the GATK GitHub repository.

However, the problem persists, and the "N"-nucleotides remain in the REF column.

bcftools query -f '[%CHROM\t%POS\t%REF\t%ALT\n]' inputvcf.g.vcf.gz|grep -v -E '[ACTG]' | head -n 10

enter image description here

Is there something wrong with my code, the fasta file or something else? Do you have any othe suggestions on how to solve that problem? Any insights or suggestions on how to address this issue would be greatly appreciated.

Thank you in advance.

bcftools variant-calling vcf • 589 views
ADD COMMENT
0
Entering edit mode
ADD REPLY
0
Entering edit mode

Thanks to Pierre for finding the cross-post

Please keep in mind that posting the same question to multiple sites can be perceived as bad etiquette, because efforts may be made to address a problem that has already been solved elsewhere in the meantime.

The helpful thing to do if you do decide to post on multiple forums is to add a link to the other forum posts on each post so people will look at the other posts before investing their effort.

ADD REPLY
0
Entering edit mode
3 months ago
DBScan ▴ 300

Nothing is wrong, the hg38 reference contains Ns at these positions.

Instead of grep -v -E '[ACTG]', take a look at the -i and -e arguments from BCFtools.

ADD COMMENT
0
Entering edit mode

yep

$ samtools faidx hg38.fasta "chr1:121907777-121907777"
>chr1:121907777-121907777
N
ADD REPLY
0
Entering edit mode

Thank you.

Does that mean, there is nothing I can do? Should I just remove the rows with the N-values?

ADD REPLY
0
Entering edit mode

it depends of your downstream analysis....

ADD REPLY
0
Entering edit mode

Also, originally I used bcftools query -f '[%CHROM\t%POS\t%REF\t%ALT\n]' inputvcf.g.vcf.gz|grep -v -E '[ACTG]' | head -n 10 for detecting missing values in my REF-Colums.

I am not sure if all missing values in REF are labeled as N. Are N the missing values, or are there additional missing values that my code cannot detect?

Is there an existing function for detecting missing values in REF and ALT?

ADD REPLY
0
Entering edit mode
bcftools norm --check-ref w --fasta-ref in.fa
ADD REPLY
0
Entering edit mode

thank you that works, but I was wondering how to save the result of the analysis and save it to a txt-file

I use the following code to extract the result:

grep "Lines total/split/realigned/skipped"

Is there a better way?

I also can´t save the result I grep in a file.

I Tried the following:

>> missing_data.txt

I also tried tree, > and 2>&1.

Nothing worked.

ADD REPLY

Login before adding your answer.

Traffic: 3798 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6