I received 8 gzipped illumina sequecing files (18G each when gzipped), and was asked to report variants. The person that sequenced them said that they were all from the same person, but I didn't get much more information than that. Judging by the way the files are named and the information contained in the file they are paired end fastq files.
My question is this: is there a good way to compare the files to determine if they are from the same person? If they are, should I merge them into a single BAM once I have aligned them? Finally, is there a good way to determine what regions are covered by each Fastq assuming they all cover different areas?
I am relatively new to bioinformatics and would greatly appreciate any suggestions!
While you can align the data/call variants and then compare the results there would always be some niggling doubt left.
You really should go back and ask the person providing the data for more concrete information. Having undefined provenance does not help with analysis or the analyst.
Thanks for your response! I asked the person, but the sequencing was completed before their time and this data was pulled from an archive in their lab. Perhaps getting all the way to GVCFs is the best way to tell. Thanks again.
Wow, I had never heard of gVCF until today. Broad sure loves to cause trouble :)
Anyway - it's true that once you call variants, you can tell with high confidence that two datasets came from the same person. But it's not trivial and requires calibration by testing with various individuals with different degrees of relatedness in order to be accurate. Still, if you happen to end up with 99% identical variations between the different VCFs, once you exclude variations common to that ethnicity, you could guess that they're the same person and you'd probably be right. Not something I would use as the basis for further research without calibration, though.
Edit - it would not surprise me if someone has written a tool to compare VCFs and give a probability that they came from the same individual, so you might want to look around for something like that... particularly, if you restrict your VCF to the subset that is used for things like forensics and paternity testing, I imagine that most of your work would already be done.