Hi,
I am interested in downloading aligned nucleotide sequences for humans and Neanderthals. I want two human sequences (e.g. French and San) and one Neanderthal sequence. I want to perform a test similar to one performed in "A Draft Sequence of the Neandertal Genome" by Green et al. for gene flow between humans and Neanderthals. On page 130 of the supplementary material, they outline a test for gene flow where they compare states of SNPs for two human populations (French and San), Neanderthal and chimpanzee. On page 138 they have the frequencies of all possible allele patterns of matches and mismatches across the four populations. For example, the frequency of AAAA is the number of sites where all four populations are the same (e.g. the number of sites that are all A's, all G's, all C's or all T's). Likewise, ABBA (reading French, San, Neanderthal, chimpanzee from left to right) is the number of sites where French and chimpanzee are the same and San and Neanderthal are the same (e.g. AGGA, ACCA, ATTA, GAAG, ...). These frequencies are almost what I want, but I want the individual base frequencies instead (e.g. AAAA, GGGG, CCCC, TTTT, AGGA, ACCA, ...). I would like frequencies of all of the 256 possible base combinations. If someone has done this already then that would be great. If not, then I will need to get the aligned sequences and do it myself.
If I have to download the data myself then I will need some help. I have spent hours attempting to find the relevant data, but it is all gobbledygook to me. My background is in mathematics and statistics and not bioinformatics. I have tried reading the manuals, but with little success. I need an ELI5 to obtain aligned nucleotide sequences for French, San and Neanderthals.
I suspect the data I am interested in is here: https://genome.ucsc.edu/Neandertal/, however, I don't know how to interpret it.
Other places where it might be are:
http://neandertal.ensemblgenomes.org/index.html https://www.ebi.ac.uk/ena/data/view/PRJEB2065 https://www.eva.mpg.de/genetics/genome-projects/neandertal/index.html
Any help would be greatly appreciated. Thanks.
Edit: I've downloaded the BAM files from here:
ftp://ftp.ebi.ac.uk/pub/databases/ensembl/neandertal/BAM_files/
I also downloaded samtools and have worked out how to view the BAM files. How do I merge multiple BAM files so that they are aligned and then extract the nucleotide sequences?
Are you looking for sequencing reads or for an assembly? Since you're talking about bam files lower, I suspect reads, but this is not entirely clear.
BAM files usually are aligned (except when they aren't)... but the answer is probably
samtools merge
Extract individual reads, or do you want a consensus sequence from this bams?
Thanks for the reply.
I don't think it really matters. I would like the entire genomes, but really I just need sites in the sequences that are approximately independent of each other. If I have the entire genomes then I can randomly select sites far enough apart on chromosomes. If I only have reads then I should still be able to do this.
By a consensus sequence, do you mean an "averaging" of the sequences? I don't want that. Basically, I want three aligned nucleotide sequences for the two human populations and Neanderthals. e.g.
AGTCTACTA...
AGGCTACTA...
AGGCAACTA...
Then I can select a subset of the sites. Missing data is fine. I don't need anything else (except maybe chromosome so I can choose which sites are roughly independent).