Hello everyone,
I need some help/ideas in calculating average depth per SNP across samples. I have already generated depth files for each sample (around 18000 samples in total) using samtools depth input.bam > output.txt and I have merged all the output.txt's into 1 txt.gz. The input file is sorted by position and it looks like this.
sample1   17   123456   1000
sample2   17   123456   500
sample3   17   123456   10
sampleN   17   123456   50
sample1   17   123457   1100
sampleN   17   123457   50
so first column contains sample names second column is chromosome number third column is the snp position and the last column is depth. I am just wondering if there is quick way to get average depth per SNP across samples as my input file is huge (around 38gb)? or if there is an alternate way?
Looking forward to your suggestions and ideas.
Regards,
38gb is really huge to read by any language (unless you employ some database techniques). The best option I see is using bedtools groupby function. Get the script ready with sample data and run it on your original file and take some rest.
yes I have already sorted the file based on chromosome/position. do you recommend splitting per chromosome? even if I do it I think it will still be quite large as I have 18k samples in there.
if its already sorted groupBy will help, except it might take long time. Try it.