Question: Looking for neanderthal genomes to download
1
gravatar for severalorks
3.3 years ago by
severalorks90
severalorks90 wrote:

EDIT: Since the data I'm looking for isn't available, my new question is if it's possible to concatenate together the sequence pieces from a fasta file that lists pieces of the sequence? How do I interpret what each part of the query template name means in the fasta file? I assume one of the number at the end refers to chromosome and other refer to start/end positions relative to the entire genome. If I know the start/end positions, I can order the pieces together, noting the gaps in between. For instance, for individual Sid1253, this is a query template name and sequence associated with it:

>M_SOLEXA-GA02_JK_PE_SL49:2:91:375:1301
TGCTCAGGTGGAGTGAGGGGAAAATGTTTTCAGGTTGTATTAGTCAAAACAAAATA

OLD POST: I'm looking to download several (3 to 6) Neanderthal genomes which have been mapped to a human reference genome. The file format should be fasta. I've checked the Neanderthal Genome Project and found several bam files, which I converted to fasta. These are the links to them: ftp://ftp.ebi.ac.uk/pub/databases/ensembl/neandertal/BAM_files/ http://cdna.eva.mpg.de/neandertal/altai/AltaiNeandertal/bam/

However, the fasta files list each individual's genome as snippets (from my interpretation; I've only begun to work with fasta formats). I think that those pieces can be concatenated together to give entire chromosome sequences, but I'm not sure how to do that. So I'm looking for the entire, long genome. More specifically, I'm looking for the chromosome-level sequence for each Neanderthal individual, where regions that haven’t been sequenced are masked as N's.

My questions are: 1. Where can I find this data? 2. If this data isn't available in the desired format, is it possible to concatenate together the sequence pieces from those links? How do I interpret what each part of the query template name means in the fasta file?

Thanks.

ADD COMMENTlink modified 3.3 years ago • written 3.3 years ago by severalorks90
1

How did you convert them to fasta (by generating a consensus from the BAM)? Someone else who knows more will comment but I doubt you are going to find chromosome-size fasta files for Neanderthal genomes.

ADD REPLYlink written 3.3 years ago by genomax72k
1

I used samtools then seqtk

ADD REPLYlink written 3.3 years ago by severalorks90
1

This previous Biostars post will take you to some files available on ENA (Study: ERP000119) with the the draft sequence of the Neandertal genome (over 3 billion nucleotides) from three individuals. Would it be of any use for what you are after?

ADD REPLYlink written 3.3 years ago by Denise - Open Targets5.0k
1

Thanks, I'll take a look at it

ADD REPLYlink written 3.3 years ago by severalorks90

Since the data I'm looking for isn't available, my new question is if it's possible to concatenate together the sequence pieces from a fasta file that lists pieces of the sequence? How do I interpret what each part of the query template name means in the fasta file? I assume one of the number at the end refers to chromosome and other refer to start/end positions relative to the entire genome.

Simple answer is no for the new question you posted after the edit.

I am reasonably certain that the example sequence you posted is a fasta format version of a standard Illumina fastq read (header) . The original fastq headers have a specific meaning which signifies the position of the cluster (in a specific lane at x,y location) where the sequence originated on a flowcell.

You may be able to get a consensus sequence from the BAM files (see: Generate consensus from BAM file ) that you have seen everywhere though that may not be the correct thing to do (otherwise the people who generated the sequence would have done that).

ADD REPLYlink modified 3.3 years ago • written 3.3 years ago by genomax72k
6
gravatar for Gabriel R.
3.3 years ago by
Gabriel R.2.6k
Center for Geogenetik Københavns Universitet
Gabriel R.2.6k wrote:

Hey severalorks,

I think something is perhaps not clear. When you have modern DNA, you can get long contiguous strands and you can assemble contigs which are then further assembled into scaffolds using mate-pairs for instance. But the reason why you can do this is because DNA is still contiguous.

With ancient DNA, the original contiguous strands are now millions of very short DNA fragments. If you are lucky, you might have an average fragment length of 60-70, if you are unlucky, you get 30-40bp. The best thing you can do is simply map them to a close enough reference genome and infer SNPs from those.

When we say we sequenced the Neanderthal genome, we didn't in the classical sense, we just retrieved DNA fragments from fossils and mapped them to the human reference. I guess you could always infer the sequence using the SNPs but then how do you handle:

  • linked SNPs
  • any genome rearrangements, even smaller ones

I think you should go back to the biological question at hand. My guess is that the SNPs could provide you sufficient answers.

Hope this helps :-)

ADD COMMENTlink written 3.3 years ago by Gabriel R.2.6k

You mention short fragments makes assembly not possible, but didn't people do genome assembly using early Illumina sequencing when the reads were 35bp?

ADD REPLYlink written 3.3 years ago by igor8.6k

You could get contigs, but to make these contigs into proper scaffolds, you need mate-pairs reads with various insert sizes (e.g. 2kb, 5kb etc..).

ADD REPLYlink written 3.3 years ago by Gabriel R.2.6k
0
gravatar for severalorks
3.3 years ago by
severalorks90
severalorks90 wrote:

Thanks for the responses, everyone, it's helped my understanding of the topic. Instead of concatenating the contigs together, I'm now looking to just find which locations (start and end positions) on the human reference genome hg19 each contig is aligned to. Was this done before, and where can I find this data? I haven't been able to extract this information from the files I've looked at so far.

EDIT: Genome Browser shows the start/end positions when loading from a bam file, which meant the data I sought was in bam all along. Turns out I should use Bedtools to convert to .bed to obtain start/end positions. The .bed file does not contain the sequence, but it does contain the Illumina fastq header for each contig. This means I can look at the .fasta file for the contig sequence, and then use the fastq header as a bridge to find the sequence's corresponding start/end position.

ADD COMMENTlink modified 3.3 years ago • written 3.3 years ago by severalorks90
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1107 users visited in the last hour