How to read VCF (v4.1) file? (student project) (samtools version 0.1.19)
0
0
Entering edit mode
3.1 years ago
c.clarido ▴ 80

Hello community,

Below is a fragment of my VCF file:

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  aln.sorted.bam
11561   6   .   C   X   0   .   DP=1;I16=1,0,0,0,38,1444,0,0,6,36,0,0,0,0,0,0;QS=1.000000,0.000000,0.000000,0.000000    PL  0,3,6
11561   7   .   C   X   0   .   DP=1;I16=1,0,0,0,38,1444,0,0,6,36,0,0,1,1,0,0;QS=1.000000,0.000000,0.000000,0.000000    PL  0,3,6
11561   8   .   A   X   0   .   DP=1;I16=1,0,0,0,38,1444,0,0,6,36,0,0,2,4,0,0;QS=1.000000,0.000000,0.000000,0.000000    PL  0,3,6
11561   19  .   G   A,X 0   .   DP=2;I16=1,0,1,0,38,1444,38,1444,6,36,6,36,7,49,13,169;QS=0.500000,0.500000,0.000000,0.000000;RPB=9.668049e-01  PL  1,0,1,4,4,4
11561   337 .   ca  cAa 0   .   INDEL;IS=1,0.200000;DP=5;I16=1,1,1,0,68,2312,40,1600,2,2,15,225,35,697,6,36;QS=0.347826,0.652174,0.000000,0.000000;VDB=6.365454e-02 PL  6,0,0
11561   375 .   taaaaaaaa   taaaaaaa    0   .   INDEL;IS=1,0.200000;DP=5;I16=0,0,1,0,0,0,18,324,0,0,6,36,0,0,25,625;QS=0.000000,1.000000,0.000000,0.000000  PL  6,3,0
11561   390 .   A   T,G,X   0   .   DP=4;I16=0,0,2,1,0,0,114,4332,0,0,8,38,0,0,75,1875;QS=0.000000,0.571429,0.428571,0.000000;VDB=5.227153e-02  PL  5,6,0,5,2,2,5,6,5,5
11561   740 .   AC  ACC 0   .   INDEL;IS=4,0.102564;DP=39;I16=20,7,2,1,741,22083,49,819,457,12337,5,13,524,11426,48,866;QS=0.970732,0.029268,0.000000,0.000000;VDB=5.891593e-02 PL  0,54,149


I need to find the SNPs and INDELs, however I don't quiet understand the X under the ALT. Does it mean X-chomosome? Does the following mean that there is a SNP from G to A on chromosome X? How do I know that significance of this result? Is it the QS=0.500000 or the QUAL=0? Is there anything I need to know when collecting the SNPs and INDELs?

11561   19  .   G   A,X 0   .   DP=2;I16=1,0,1,0,38,1444,38,1444,6,36,6,36,7,49,13,169;QS=0.500000,0.500000,0.000000,0.000000;RPB=9.668049e-01  PL  1,0,1,4,4,4


VCF BCF samtools variantcalling • 1.8k views
1
Entering edit mode

Hello,

the fragment you are showing is not a vcf. It is the output of samtools/bcftools mpileup. You need to do a bcftools callon it. See Devon's answer here.

fin swimmer

0
Entering edit mode

Hello thank you for the link, however I did the following:

samtools mpileup -f reference.fa aln.sorted.bam > aln.mpileup
samtools mpileup -uf reference.fa aln.sorted.bam > calls.bcf
bcftools view calls.bcf > call.vcf


I am using samtools version 0.1.19. I thought mpileup should look like this:

11561   49  a   8   ........    GGGGGG;G
11561   50  t   8   ........    GGGGFGGG
11561   51  g   8   ........    GGGGGGGG
11561   52  a   8   ........    GGGEGGGG
11561   53  g   8   ........    GGGGGGGG
11561   54  g   8   ........    GGGGGGGG
11561   55  a   9   ........^'. GGGGGGGGA
11561   56  a   9   .........   GGGGGGGGB
11561   57  a   9   .........   GGGGGGGGE
11561   58  c   9   .........   GGGEGGGGG
11561   59  c   9   ....G....   GGGEGGGGG
11561   60  g   10  .........^'.    GGGGGGGGG?
11561   61  g   10  ..........  GGGGGGGGGA

0
Entering edit mode

Hello cedrickclarido ,

samtools 0.1.19 is very, very old. Current version is 1.9. I strongly recommend to update. (The easiest way is to use conda.) In the current version samtools mpileup is deprecated and replaced by bcftools mpileup.

Beside this: In your code you never use aln.mpileup. Instead you do a second mpileup and forced the output to be a uncompressed bcf format. In the newer versions this is the standard (I believe the old output format your are showing isn't used anymore.) Your bcftools view command is then just converting the bcf version of the mpileup result to vcf. You are missing the bcftools call step.

fin swimmer

0
Entering edit mode

Hello, indeed the version is really old. However, as student I am limited to this version as I am working in a school's server where the tool is pre-installed. I can't seems to find any good manuals or documentations to convert pileup to bcf and bcf to vcf. As shown from my previous comment:

samtools mpileup -f reference.fa aln.sorted.bam > aln.mpileup
samtools mpileup -uf reference.fa aln.sorted.bam > calls.bcf
bcftools view calls.bcf > call.vcf


I used the first code to generate a PILEUP-file. I found online that without using the flag -u or -g, the command would generate a PILEUP file, otherwise it would create a BCF. According to the manual of version 0.1.19, the bcf file can be converted by using view option, where the input is on default BCF and output is VCF. Did I perhaps misused the command? The project also assigned us to convert the PILEUP to bcf using bcftools, however I can't find any commands that fits the criteria other than reusing the samtools mpileup again but without the -u/-g flags.

Thank you.

0
Entering edit mode

You can always request for the latest software to be installed on the school server. In all probability, it's already installed and has a mechanism to be loaded (such as module load) or the admin will install it in a couple of minutes. Working with outdated software puts you at a significant disadvantage.

0
Entering edit mode

I think your problem is understanding what is mpileup for. I will try to explain it:

Using samtools/bcftools for variant discovering is a 2-step process.

In the first step - called mpileup - samtools will generate information about each single position of the reference sequence. This contains how many reads cover this site, if the base on the position matches the reference and what's the quality for each base.

Based on this information bcftools calldecide in a second step on which position it will have a closer look if there is a variant. To decide if there is a variant more critera than only the number of reads that support the variant and the base quality are needed. For example the mapping quality, the position within the read, compare those metrics between read that support the reference and those supporting the variant, ...

So in a very basic pipeline your command should look like this:

$samtools mpileup -fu reference.fa aln.sorted.bam > aln.mpileup$ bcftools view -vcg  aln.mpileup >  call.vcf


In the current version of samtools/bcftools like this:

$bcftools mpileup -Ou -f reference.fa aln.sorted.bam | bcftools call -mv > call.vcf  fin swimmer ADD REPLY 1 Entering edit mode I'm not sure that's the right samtools command line. Don't you need -u? I'm pretty sure the regular eyeball-friendly pileup format is not what bcftools has ever wanted as input. ADD REPLY 1 Entering edit mode Yes, you're right. Fixed it. ADD REPLY 0 Entering edit mode It worked! thank you community! ADD REPLY 0 Entering edit mode Thank you the explanation sir, I tried the following and seems to get errors all the time: s1104230@Bioserver02:~/test2$ time samtools mpileup -f reference.fa aln.sorted.bam > aln2.mpileup
[mpileup] 1 samples in 1 input files
<mpileup> Set max per-file depth to 8000

real    19m30.006s
user    19m21.028s
sys 0m7.384s
s1104230@Bioserver02:~/test2$time bcftools -vcg aln2.mpileup > call2.vcf [main] Unrecognized command. real 0m0.001s user 0m0.000s sys 0m0.000s s1104230@Bioserver02:~/test2$ time bcftools view -vcg  aln2.mpileup >  call2.vcf
[bcf_sync] incorrect number of fields (0 != 5) at 0:0
[afs] 0:0.000

real    0m0.014s
user    0m0.000s
sys 0m0.000s

1
Entering edit mode

Hello cedrickclarido ,

again: Update the version!

I edited my commands as I missed the view in the bcftoolscommand. And as swbarnes2 says the output of mpileup must be in a bcfformat.

fin swimmer

0
Entering edit mode

I don't think it is easy to help when we don't have access to the extensive bcftools documentation for the version you're using. There might be changes in arguments between the old bcftools view caller and the current bcftools call caller that cannot be easily mapped. Please upgrade to the latest version. I don't think you'll need any special permissions to compile it from source.