Question: Determine sex from vcf file (or sequencing data)
2
gravatar for kulvait
2.9 years ago by
kulvait230
Czech Republic
kulvait230 wrote:

Hi, I have some mouse samples. I would like to determine sex to test for sample switch or corruption from sequencing data. I have vanila freebayes called vcf files. IĀ figured out that there is plugin to determine sex from vcf files

bcftools +vcf2sex

However it seems to require ploidy information for given genome(see for example this problem https://github.com/samtools/bcftools/issues/175). This information should be in the format

space/tab-delimited list of CHROM,FROM,TO,SEX,PLOIDY

I am using GRCm38.82 (http://ftp.ensembl.org/pub/release-82/gtf/mus_musculus/README) for read mapping and I cannot find that ploidy information for given genome release.

I would like to ask:

  1. Which is most reliable way to determine sex of the sample either from bam or vcf file?
  2. What does exactly ploidy information mean? Are those coordinates of pseudoautosomal regions? And why this information is required for vcf2sex to work properly?
  3. Where can I find that ploidy information for given genome release?
genotyping variant vcf • 3.7k views
ADD COMMENTlink modified 2.2 years ago by Biostar ♦♦ 20 • written 2.9 years ago by kulvait230
3
gravatar for kulvait
2.9 years ago by
kulvait230
Czech Republic
kulvait230 wrote:

I figured out that for the assemblies there are informations about pseudoautosomal regions available through Genome reference consortium. In case of my genome I had to read this http://ftp.ensembl.org/pub/release-82/fasta/mus_musculus/dna/README to understand that DNA sequence is GCA_000001635.6 which is GRCm38.p4 and then from http://www.ncbi.nlm.nih.gov/projects/genome/assembly/grc/mouse/ I can figure out which are the coordinates of pseudoautosomal regions. From that I can assemble ploidy file that might be

X       1       169969759       M       1
X       170931300       171031299       M       1
Y       1       91744698        M       1
Y       1       91744698        F       0

However I figgured out the tool works better when using ploidy file of the form

X       1       169969759       M       1
Y       1       91744698        M       1
Y       1       91744698        F       0

When facing this you can also try using this repository https://github.com/pd3/bcftools and plugin called +guess-ploidy

ADD COMMENTlink written 2.9 years ago by kulvait230
5
gravatar for kulvait
2.9 years ago by
kulvait230
Czech Republic
kulvait230 wrote:

During my analysis I figured out that indicator of sex is also coverage of sex chromosomes. IĀ found out that I can guess sex from two tests. First test is weather the average coverage of chrX is greater than minimum of average coverages of autosomal chromosomes then the sample is probably female. Second test is weather the average coverage of chrY is 10 times smaller than average coverage of autosomal chromosomes then the sample is probably female. Third test is to use +vcf2sex or similar plugin. I am also posting my code to guess male/female samples (19 autosomal chromosomes) written in bash I expect that chromosomes are named (1, X, Y) not (chr1, chrX, chrY).

coveragestats - Script to print filename and coverage statistics from samtools indexed files

echo $*
samtools idxstats $* 2>/dev/null | awk '{ tmp=($3)/($2) ; printf"%s %0.5f\n", $1, tmp }' 2>/dev/null | head -21

autosomalmin - Script to print minimum average coverage of autosomal chromosomes

coveragestats $* | cut -d' ' -f2 | head -20 | tail -19 | awk 'NR == 1 || $1 < min {min = $1}END{printf "%f\n", min}'

xcoverage - Script to print average coverage of chrX

coveragestats $* | grep ^X | cut -d" " -f2

ycoverage - Script to print average coverage of chrY

coveragestats $* | grep ^Y | cut -d" " -f2

maleorfemale - Script to perform first two described sex tests

min=$(autosomalmin $*)
xcov=$(xcoverage $*)
ycov=$(ycoverage $*)
test=$(echo "$xcov"'>'"$min" | bc -l)
if [ $test -eq 1 ]
then
    echo "$*" is female
else
    echo "$*" is male
fi
test=$(echo "$ycov"'<('"$min"/10\) | bc -l)
if [ $test -eq 1 ]
then
        echo "$*" is female
else
        echo "$*" is male
fi

This worked for me and I have pretty low number of false detections.

ADD COMMENTlink written 2.9 years ago by kulvait230
2
gravatar for WouterDeCoster
2.9 years ago by
Belgium
WouterDeCoster38k wrote:

You want to take the pseudo autosomal regions into account on the X and Y chromosome (https://en.wikipedia.org/wiki/Pseudoautosomal_region). Best would be to check coverage outside of those regions.

You also want to discard reads with low mapping quality.

I once wrote a simple pysam funtion for this to compare the Y chromosome with chr22. You will have to adapt, I can help you if required. It becomes very clear which samples are male and which female. This example is for human data, the coordinates are for hg19. It just works on the (sorted and indexed) bam file.

    yreadcount = 0
    for read in workfile.fetch(region='chrY:2781479-56887902'):
            if not read.mapping_quality < 5 and not read.reference_end == None and not read.reference_start == None:
                    yreadcount += 1
    refreadcount = 0
    for read in workfile.fetch('chr22'):
            if not read.mapping_quality < 5 and not read.reference_end == None and not read.reference_start == None:
                    refreadcount += 1
    print("{} {} ({} vs. {})".format((yreadcount/float(refreadcount)), bam, yreadcount, refreadcount))
ADD COMMENTlink modified 2.9 years ago • written 2.9 years ago by WouterDeCoster38k
1
gravatar for Brice Sarver
2.9 years ago by
Brice Sarver2.6k
United States
Brice Sarver2.6k wrote:
  1. People generally look at coverage on sex chromosomes or sex-linked contigs. In eutherians, females won't have any coverage on the Y, and males will have approximately half of the coverage on the X relative to the autosomes (assuming WGS).
  2. It appears that vcf2sex requires you to specify the ploidy of each region and then uses heterozygous calls in these regions to determine sex.
  3. Ploidy is organism-specific. See #2. I would assume that you would specify the regions' ploidy based on the genetic structure (i.e., 2 for autosomes, etc.). I am not aware that this is released formally for genome builds, but it ought to be intuitive.

I wouldn't go through using a program to do this, though. It's somewhat unclear what vcf2sex really does. If you already have the BAM/VCF, I'd try to infer sex using differential coverage of sex chromosomes vs. autosomes.

ADD COMMENTlink written 2.9 years ago by Brice Sarver2.6k
3

As a validation step, I would first test the solutions on already sex-known samples to build an expectation to what the solutions would lead and their accuracy.

This step can also clarify what vcf2sex exactly does.

ADD REPLYlink modified 2.9 years ago • written 2.9 years ago by H.Hasani680

Thank you for your help but I don't see it very intuitive. While I need to find coordinates of autosomal regions on X chromosome in some refference. So how exactly would you find them for my build GRCm38.82 (http://ftp.ensembl.org/pub/release-82/gtf/mus_musculus/README)?

Your advice about coverage is probably the best I can do. I have exome seq data but I guess everything apply.

ADD REPLYlink written 2.9 years ago by kulvait230
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1319 users visited in the last hour