Question: How do I download nucleotide sequence data for Neanderthals and humans?
1
gravatar for jonathanmitchell88
9 days ago by
jonathanmitchell8810 wrote:

Hi,

I am interested in downloading aligned nucleotide sequences for humans and Neanderthals. I want two human sequences (e.g. French and San) and one Neanderthal sequence. I want to perform a test similar to one performed in "A Draft Sequence of the Neandertal Genome" by Green et al. for gene flow between humans and Neanderthals. On page 130 of the supplementary material, they outline a test for gene flow where they compare states of SNPs for two human populations (French and San), Neanderthal and chimpanzee. On page 138 they have the frequencies of all possible allele patterns of matches and mismatches across the four populations. For example, the frequency of AAAA is the number of sites where all four populations are the same (e.g. the number of sites that are all A's, all G's, all C's or all T's). Likewise, ABBA (reading French, San, Neanderthal, chimpanzee from left to right) is the number of sites where French and chimpanzee are the same and San and Neanderthal are the same (e.g. AGGA, ACCA, ATTA, GAAG, ...). These frequencies are almost what I want, but I want the individual base frequencies instead (e.g. AAAA, GGGG, CCCC, TTTT, AGGA, ACCA, ...). I would like frequencies of all of the 256 possible base combinations. If someone has done this already then that would be great. If not, then I will need to get the aligned sequences and do it myself.

If I have to download the data myself then I will need some help. I have spent hours attempting to find the relevant data, but it is all gobbledygook to me. My background is in mathematics and statistics and not bioinformatics. I have tried reading the manuals, but with little success. I need an ELI5 to obtain aligned nucleotide sequences for French, San and Neanderthals.

I suspect the data I am interested in is here: https://genome.ucsc.edu/Neandertal/, however, I don't know how to interpret it.

Other places where it might be are:

http://neandertal.ensemblgenomes.org/index.html https://www.ebi.ac.uk/ena/data/view/PRJEB2065 https://www.eva.mpg.de/genetics/genome-projects/neandertal/index.html

Any help would be greatly appreciated. Thanks.

Edit: I've downloaded the BAM files from here:

ftp://ftp.ebi.ac.uk/pub/databases/ensembl/neandertal/BAM_files/

I also downloaded samtools and have worked out how to view the BAM files. How do I merge multiple BAM files so that they are aligned and then extract the nucleotide sequences?

sequencing alignment • 140 views
ADD COMMENTlink modified 9 days ago by k.kathirvel93150 • written 9 days ago by jonathanmitchell8810

I am interested in downloading aligned nucleotide sequences for humans and Neanderthals. I want two human sequences (e.g. French and San) and one Neanderthal sequence.

Are you looking for sequencing reads or for an assembly? Since you're talking about bam files lower, I suspect reads, but this is not entirely clear.

How do I merge multiple BAM files so that they are aligned

BAM files usually are aligned (except when they aren't)... but the answer is probably samtools merge

and then extract the nucleotide sequences?

Extract individual reads, or do you want a consensus sequence from this bams?

ADD REPLYlink written 9 days ago by WouterDeCoster35k

Thanks for the reply.

Are you looking for sequencing reads or for an assembly? Since you're talking about bam files lower, I suspect reads, but this is not entirely clear.

I don't think it really matters. I would like the entire genomes, but really I just need sites in the sequences that are approximately independent of each other. If I have the entire genomes then I can randomly select sites far enough apart on chromosomes. If I only have reads then I should still be able to do this.

Extract individual reads, or do you want a consensus sequence from this bams?

By a consensus sequence, do you mean an "averaging" of the sequences? I don't want that. Basically, I want three aligned nucleotide sequences for the two human populations and Neanderthals. e.g.

AGTCTACTA...

AGGCTACTA...

AGGCAACTA...

Then I can select a subset of the sites. Missing data is fine. I don't need anything else (except maybe chromosome so I can choose which sites are roughly independent).

ADD REPLYlink modified 9 days ago • written 9 days ago by jonathanmitchell8810
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 693 users visited in the last hour