0
I am working with genomic data and I am trying to do GWAS analysis with a dataset where I compare two populations of a bird species.
I have a good reference genome (ordered by contig size) and resequence data from several individuals of 2 different populations.
I did 1) all the trimming of the reads, 2) i indexed the genome, 3) mapped the reads to the reference genome, 4) removed duplicates, 5) indexed the bamfiles, 6) did the variant calling, the filtering, 7) masked the repeats and 8) decomposed to SNPs etc. All of this is done with bash scripts, using vcftools and bioinfo-tools
Now I am plotting nucleotide diversity, Tajimas D and fst plots but as the genome is ordered by contig, the plots are quite meaningless. A colleague gave me a .csv file with the allocation of each contig to a certain chromosome.
In which part of my workflow can I use this .csv file? can i do it in my plotting scripts in R or does it have to be in an earlier step of my analysis? is there a standard code?
I am not a bioinformatician and so I try to understand every step of the process but sometimes I miss the point of what I am actually doing...
Really appreciate any help!
I am expecting to plot TajimasD, fst and nucleotide diversity by chromosome instead of contig.
As you suggested, you could order and orient your contigs to represent that of the chromosome, but you would be missing all the information at contig junctions, where you'd have a dip in your population genomic metrics.
Is the newer chromosome assembly an update of the existing one? If it is, then there isn't much issue. If it isn't, and it's from a differentiated population then it may be misrepresenting your data. Ultimately, it would be best to rerun the analysis with the chromosome level assembly.
Hi! Thank you for your reply!
I don't have a chromosome-level genome assembly (my reference genome is ordered by contig). What I have is a .csv file that states to which chromosome does each contig belong to, but this .csv has been built by someone else (not sure of the methods, i guess by homology to a close by species).
My feeling is that you know more information about how the contigs have been ordered and oriented the contigs to know if it's trustworthy before using it.
If they used a different species to order the genome then you could be including lots of spurious genomic structural variation into your data.
If they have generated a pseudochromosome-level assembly (which is what is sounds like), and it's from the same strain or cultivar, then you could request it and use it as a reference genome.
The contig-level genome assembly and the chromosome allocation .csv file have been both made by the same person in my unit, so I trust they correlate well. But we don't have a genome assembly at the chromosome level, because those two files were created years appart. My issue is that i don't know how to apply the chromosome allocation in the .csv file to my re-sequence data to produce meaningful manhattan plots. The plots i get are ordered by contig which is quite useless...
You've just described a pseudo-chromosome assembly. If order and orient contigs based on something like synteny to a chromosome level assembly of a different population or similar species, that is your output. Tools like Chromosemble do exactly that. And it doesn't matter that they are made years apart.
I don't understand the problem then. Make a contiguous sequence my merging one contig end with the start of its neighbour according to the spreadsheet. The you can generate a manhattan plot as normal. Again, you'd be losing mapping at contig junctions, but if you don't have that many contigs it's unlikely that big an issue.
Though you still haven't explained how they got the chromosome orientation though. If it was using a different species you may be incorporating structural variation among species into your data.