Compare Fastq files
0
0
Entering edit mode
7.3 years ago

I received 8 gzipped illumina sequecing files (18G each when gzipped), and was asked to report variants. The person that sequenced them said that they were all from the same person, but I didn't get much more information than that. Judging by the way the files are named and the information contained in the file they are paired end fastq files.

My question is this: is there a good way to compare the files to determine if they are from the same person? If they are, should I merge them into a single BAM once I have aligned them? Finally, is there a good way to determine what regions are covered by each Fastq assuming they all cover different areas?

I am relatively new to bioinformatics and would greatly appreciate any suggestions!

sequence alignment NGS Bam • 2.2k views
ADD COMMENT
1
Entering edit mode

While you can align the data/call variants and then compare the results there would always be some niggling doubt left.

You really should go back and ask the person providing the data for more concrete information. Having undefined provenance does not help with analysis or the analyst.

ADD REPLY
0
Entering edit mode

Thanks for your response! I asked the person, but the sequencing was completed before their time and this data was pulled from an archive in their lab. Perhaps getting all the way to GVCFs is the best way to tell. Thanks again.

ADD REPLY
0
Entering edit mode

Wow, I had never heard of gVCF until today. Broad sure loves to cause trouble :)

Anyway - it's true that once you call variants, you can tell with high confidence that two datasets came from the same person. But it's not trivial and requires calibration by testing with various individuals with different degrees of relatedness in order to be accurate. Still, if you happen to end up with 99% identical variations between the different VCFs, once you exclude variations common to that ethnicity, you could guess that they're the same person and you'd probably be right. Not something I would use as the basis for further research without calibration, though.

Edit - it would not surprise me if someone has written a tool to compare VCFs and give a probability that they came from the same individual, so you might want to look around for something like that... particularly, if you restrict your VCF to the subset that is used for things like forensics and paternity testing, I imagine that most of your work would already be done.

ADD REPLY

Login before adding your answer.

Traffic: 2710 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6