Hello community,
I think I run in something of a non-concordant result and I wanted to check out here for some advice. Starting at the SNP calling with ANGSD, I used the following script (notice I changed file names):
#!/bin/bash
#SBATCH --account my_account
#SBATCH -c 40
#SBATCH --mem 64g
#SBATCH --time 60:00:00
#SBATCH --output output.out
#SBATCH --error error.err
# Locations
REF=/home/my_user/my_account/10-19_Data/13_Refs/13.03_Geno/reference.fa
BAMLIST=/home/my_user/my_account/10-19_Data/16_Lists/bam-final-list.txt
OUTDIR=/home/my_user/my_account/10-19_Data/14_Out-data/14.03_SNP-calls
N_IND=`cat /home/my_user/my_account/10-19_Data/16_Lists/bam-final-list.txt | wc -l`
angsd \
-bam $BAMLIST \
-ref $REF \
-out $OUTDIR/minMapQ20minQ20_minInd145.25_setMinDepthInd2_setMinDepth150setMaxDepth600 \
-uniqueOnly 1 -remove_bads 1 -only_proper_pairs 1 -trim 0 -C 50 -baq 1 \
-minMapQ 20 -minQ 20 -minInd $((N_IND*1/4)) -setMinDepthInd 2 -setMinDepth 150 -setMaxDepth 600 \
-doCounts 1 -dumpCounts 2 \
-GL 1 -doGlf 2 \
-doMajorMinor 1 -doMaf 1 -SNP_pval 1e-6 -minMaf 0.05 -rmTriallelic 0.05 -doPost 1 -doGeno 8 \
-doIBS 1 -doCov 1 -makeMatrix 1 \
-nThreads 40
Forgive the long names, I should stop doing that and just create an index or README files.
After that I have my outputs, including a beagle file (with the maximum likelihood genotypes) and a pos file (the coordinates). These two files I use for estimating linkage disequilibrium (after removing headers). For that goal I used this script:
#!/bin/bash
#SBATCH --account my_account
#SBATCH -c 40
#SBATCH --mem 128g
#SBATCH --time 72:00:00
#SBATCH --output ngsLD_script.out
#SBATCH --error ngsLD_script.err
GENO=/home/my_user/my_account/10-19_Data/14_Out-data/14.04_Linkage-disequilibrium/minMapQ20minQ20_minInd145.25_setMinDepthInd2_setMinDepth150setMaxDepth600_geno.beagle.gz
POS=/home/my_user/my_account/10-19_Data/14_Out-data/14.04_Linkage-disequilibrium/minMapQ20minQ20_minInd145.25_setMinDepthInd2_setMinDepth150setMaxDepth600.pos.gz
OUTDIR=/home/my_user/my_account/10-19_Data/14_Out-data/14.04_Linkage-disequilibrium
ngsLD \
--geno $GENO \
--pos $POS \
--probs \
--n_ind 162 \
--n_sites 2106812 \
--max_kb_dist 100 \
--min_maf 0.05 \
--n_threads 40 \
--out $OUTDIR/ANGSD_SNPcalls.ld
And here is where I start to doubt the process. The output is enormous, perhaps there is one way to split this by chromosomes. But the one think that bothers me is that the number of SNPs is about 2 million, but when I check how many SNPs there are in one part of chromosome 1 (with this script, which is part of the code in https://github.com/nt246/lcwgs-guide-tutorial/tree/main/tutorial3_ld_popstructure):
#!/bin/bash
CHR=Chrom1
START_POS=4500
END_POS=5000000
TMP_FILE=`mktemp`
awk -v chr=$CHR -v min=$START_POS -v max=$END_POS 'BEGIN{print "snp1\tsnp2\tdist\tr2p\tD\tDp\tr2"} {split($1,pos1,":"); split($2,pos2,":")} pos1[1]==chr && pos1[2]>=min && pos1[2]<=max && pos2[1]==chr && pos2[2]>=min && pos2[2]<=max' | cut -f 1-7 > $TMP_FILE
N=$((`cat $TMP_FILE | wc -l`-1))
echo $N
By the way, I run this script like this in the terminal:
cat ANGSD_SNPcalls.ld | bash ../../../20-29_Scripts/25_ngsLD/SNP_number.sh
N was about 15 million, which exceeds the number of sites retained. I have little to no skill with awk (sorry for this), so perhaps is something in the code there I have not noticed.
Apologies for a rather lengthy post. I hope you have some input into this.
Thank you very much :)