2
1
Entering edit mode
4.8 years ago
severalorks ▴ 110

EDIT: Since the data I'm looking for isn't available, my new question is if it's possible to concatenate together the sequence pieces from a fasta file that lists pieces of the sequence? How do I interpret what each part of the query template name means in the fasta file? I assume one of the number at the end refers to chromosome and other refer to start/end positions relative to the entire genome. If I know the start/end positions, I can order the pieces together, noting the gaps in between. For instance, for individual Sid1253, this is a query template name and sequence associated with it:

>M_SOLEXA-GA02_JK_PE_SL49:2:91:375:1301
TGCTCAGGTGGAGTGAGGGGAAAATGTTTTCAGGTTGTATTAGTCAAAACAAAATA


OLD POST: I'm looking to download several (3 to 6) Neanderthal genomes which have been mapped to a human reference genome. The file format should be fasta. I've checked the Neanderthal Genome Project and found several bam files, which I converted to fasta. These are the links to them: ftp://ftp.ebi.ac.uk/pub/databases/ensembl/neandertal/BAM_files/ http://cdna.eva.mpg.de/neandertal/altai/AltaiNeandertal/bam/

However, the fasta files list each individual's genome as snippets (from my interpretation; I've only begun to work with fasta formats). I think that those pieces can be concatenated together to give entire chromosome sequences, but I'm not sure how to do that. So I'm looking for the entire, long genome. More specifically, I'm looking for the chromosome-level sequence for each Neanderthal individual, where regions that haven’t been sequenced are masked as N's.

My questions are: 1. Where can I find this data? 2. If this data isn't available in the desired format, is it possible to concatenate together the sequence pieces from those links? How do I interpret what each part of the query template name means in the fasta file?

Thanks.

neanderthal neanderthal genome fasta • 2.8k views
1
Entering edit mode

How did you convert them to fasta (by generating a consensus from the BAM)? Someone else who knows more will comment but I doubt you are going to find chromosome-size fasta files for Neanderthal genomes.

1
Entering edit mode

I used samtools then seqtk

1
Entering edit mode

This previous Biostars post will take you to some files available on ENA (Study: ERP000119) with the the draft sequence of the Neandertal genome (over 3 billion nucleotides) from three individuals. Would it be of any use for what you are after?

1
Entering edit mode

Thanks, I'll take a look at it

0
Entering edit mode

Since the data I'm looking for isn't available, my new question is if it's possible to concatenate together the sequence pieces from a fasta file that lists pieces of the sequence? How do I interpret what each part of the query template name means in the fasta file? I assume one of the number at the end refers to chromosome and other refer to start/end positions relative to the entire genome.

Simple answer is no for the new question you posted after the edit.

I am reasonably certain that the example sequence you posted is a fasta format version of a standard Illumina fastq read (header) . The original fastq headers have a specific meaning which signifies the position of the cluster (in a specific lane at x,y location) where the sequence originated on a flowcell.

You may be able to get a consensus sequence from the BAM files (see: Generate consensus from BAM file ) that you have seen everywhere though that may not be the correct thing to do (otherwise the people who generated the sequence would have done that).

6
Entering edit mode
4.8 years ago
Gabriel R. ★ 2.8k

Hey severalorks,

I think something is perhaps not clear. When you have modern DNA, you can get long contiguous strands and you can assemble contigs which are then further assembled into scaffolds using mate-pairs for instance. But the reason why you can do this is because DNA is still contiguous.

With ancient DNA, the original contiguous strands are now millions of very short DNA fragments. If you are lucky, you might have an average fragment length of 60-70, if you are unlucky, you get 30-40bp. The best thing you can do is simply map them to a close enough reference genome and infer SNPs from those.

When we say we sequenced the Neanderthal genome, we didn't in the classical sense, we just retrieved DNA fragments from fossils and mapped them to the human reference. I guess you could always infer the sequence using the SNPs but then how do you handle:

• any genome rearrangements, even smaller ones

I think you should go back to the biological question at hand. My guess is that the SNPs could provide you sufficient answers.

Hope this helps :-)

0
Entering edit mode

You mention short fragments makes assembly not possible, but didn't people do genome assembly using early Illumina sequencing when the reads were 35bp?

0
Entering edit mode

You could get contigs, but to make these contigs into proper scaffolds, you need mate-pairs reads with various insert sizes (e.g. 2kb, 5kb etc..).

0
Entering edit mode
4.8 years ago
severalorks ▴ 110

Thanks for the responses, everyone, it's helped my understanding of the topic. Instead of concatenating the contigs together, I'm now looking to just find which locations (start and end positions) on the human reference genome hg19 each contig is aligned to. Was this done before, and where can I find this data? I haven't been able to extract this information from the files I've looked at so far.

EDIT: Genome Browser shows the start/end positions when loading from a bam file, which meant the data I sought was in bam all along. Turns out I should use Bedtools to convert to .bed to obtain start/end positions. The .bed file does not contain the sequence, but it does contain the Illumina fastq header for each contig. This means I can look at the .fasta file for the contig sequence, and then use the fastq header as a bridge to find the sequence's corresponding start/end position.