I used jellyfish to count kmers and the obtained histograms were analyzed using Genome scope (http://qb.cshl.edu/genomescope/) in order to estimate effective genome size. However the resulting bp estimation of 61,351 bps is unexpectedly small for human data. The paired-end data was obtained from a Chip-seq experiment on Geo (GSE72141). Additionally, the model only converged on the input fastq files not the actual chip files. Any thoughts and/or suggestions are welcome.
My paired end processing workflow of raw data from GEO consisted of trimming with bbduk and filtering for minlen=36 (readlength) mapping to hg38 with bowtie2 PE removing duplicates with samtools rmdup filtering out unmapped and MAPQ<10 with samtools -F 4 -q 10 name sorting with samtools sort -n and converting back to fastq with samtools fastq bam>-1.fq.gz -2.fq.gz
I used this code for counting kmers
jellyfish count -t 30 \ -C -m 31 \ -s 5G \ -o "$line"_out \ --min-qual-char=? <(zcat "$Folder"/"$line"_1.fastq.gz) <(zcat "$Folder"/"$line"_2.fastq.gz) done<"$INPUT"
and this for histo generation
for file in /align/jelly/*_out; do jellyfish histo -t 10 "$file" >"$file".histo; done
I uploaded the obtained plots into genome scope to estimate genome size with kmer size=31 read length =36 max kmer coverage =1000
Results of which are:
GenomeScope version 1.0 k = 31 property min max Heterozygosity 2.14024% 2.22375% Genome Haploid Length 59,018 bp 61,351 bp Genome Repeat Length 29,755 bp 30,931 bp Genome Unique Length 29,263 bp 30,420 bp Model Fit 90.075% 94.9174% Read Error Rate 11.7095% 11.7095%