Find ~1 Mb regions of genome that are shared by two or more WGS samples?
0
0
Entering edit mode
12 weeks ago

Hello,

I have 12 WGS samples that are part of a big pedigree, I need to pinpoint genomic regions ~1 Mbp that are identical between some or many or all of them. I've never done analyses like these before and it's difficult to Google because I don't know the words people use for these things. I believe "linkage analysis" is close, but seems like that's more about judging whether two SNVs are in disequilibrium or not. But isn't that what phasing is?

Ahh. Please point a finger in the right direction! I usually manage to figure things out once I get started, but I'm not even sure where to start here. :-]

pedigree linkage genetics • 903 views
0
Entering edit mode

I think, that synteny is probably the term you are looking for...it refers to the conserved order of aligned genomic blocks between species.

0
Entering edit mode

if e.g. the 12 WGS samples are same species, synteny would probably overcomplicate things... it may make sense to look for "low nucleotide diversity" https://www.york.ac.uk/res/dasmahapatra/teaching/MBiol_sequence_analysis/workshop4_2019.html#nucleotide_diversity

0
Entering edit mode

this tool pixy might be a good tool to analyze it as an alternative to that tutorial too https://pixy.readthedocs.io/en/latest/plotting.html

0
Entering edit mode

Hey that's perhaps exactly what's needed hmm... so I would identify the identical 1Mbp regions by clear drops in the pi statistic?

0
Entering edit mode

Assuming your data is assembled use the programs mentioned in this thread: What is the best tool to do a dot-plot alignment of two chromosomes?

0
Entering edit mode

Thanks for your reply Genomax! My data is not assembled I'm afraid, it's currently in fastq, bam, and vcf format. I heard a term "IBD", Identity By Descent, do you think this is what I'm looking for? It seems promising.

0
Entering edit mode

I assume your reads aren't 1Mb in size so they would need to be assembled/aligned first

0
Entering edit mode

Absolutely, but don't you think there's another way? I doubt assembling every WGS sample is the most efficient approach!

1
Entering edit mode

There is a number of software packages listed for IBD in Wiki entry https://en.wikipedia.org/wiki/Identity_by_descent You can see if any will take unassembled reads.

How related are these samples (are they human?) since

The expected number of IBD segments decreases with the number of generations since the common ancestor at this locus.

0
Entering edit mode

They are all human and related within three or four generations. Thank you very much for that list, I will pick out a few and try!

0
Entering edit mode

Assemble one genome and map the reads of the other samples against it with e.g. Bowtie2. Use Bedtools multiinter to extract the shared regions which meet your threshold..

0
Entering edit mode

Hi Heikki, that's an interesting approach, I will return to it if the IBD programs don't work out. Thank you!

0
Entering edit mode

Hello again 5heikki, I'm beginning to suspect that the IBD softwares I've found are meant for much larger samples than 12, so all the statistics can be calculated properly. That means I should try your approach instead. I will do so. Have you done it before? I imagine you have. Did it work well?

Also, all samples are human. Is assembly still necessary? And, how would I choose the sample for assembly? The one with best coverage? Hmm...

And uh, if I choose one sample to assemble, I will essentially pretend that sample was haploid, no? There must be another way... hmm. Perhaps I can just use multiinter on all my bam files?

0
Entering edit mode

Hi, I thought you were working with bacterial genomes. Sorry, can't help really..

0
Entering edit mode

No worries m8, thanks for trying!

0
Entering edit mode

Hey Genomax, I have fastq and bam, I could create assemblies of all samples and dotplot them manually... but surely that's insane? Also, while assembling I would collapse the data to haploidity(?). Hmm...

1
Entering edit mode

what kind of reads do you have... if it's short-reads... welp good luck with the phasing.

why don't you use GATK to call variants and then group them up in R or python based on megabase windows manually?

0
Entering edit mode

I am calling variants in all samples with GATK as we speak, but what do you mean with "group up on megabase windows"?

0
Entering edit mode

I would do this in R myself with GRanges. Essentially, convert the coordinates to like chr3:3842934_G>A and then do a sliding window (width = 1 MB) across the genome (if thats rough computationally you could try tiling) and make like a table with each variant in that region and then find the individuals that exhibit like 95% or same SNPs for each window. You might have to think about what filters/variant allele freq. cutoff you want to use and whether 95% is too much/little to work with.

0
Entering edit mode

This sounds to me like a straightforward comparison of SNPs - I could do it in e.g. Python too, right? Or is GRanges doing some magic behind the scenes? Also, I suspect this analysis is included in Pixy: https://pixy.readthedocs.io/en/latest/plotting.html

One of the calculated statistics is variation within a population - surely it is based on comparing SNPs, like you suggest?

Thank you for your time and help!

0
Entering edit mode

Yeah you could do it with any language. Using Bioconductor packages, it would take me 2-3 hours probably in R with GRanges/VariantAnnotation to do the whole thing, if you're not familiar with R then I would not suggest that route.

0
Entering edit mode

Understood. Thank you Ben! :)