Question

Applications/Methods For Whole Genome Shotgun Assembly?

1

Entering edit mode

10.8 years ago

isa29 ▴ 10

(I have almost no experience with bioinformatics or biology in general prior to this summer; please excuse any gross abuses of terminology or general misunderstandings regarding the field)

I'm working with some FASTQ files for a project, about 40 gigabytes of them (17-18 Gb, 70-150 bp per sequence), and I suspect they're the result of shotgun sequencing, because there's no way the genome the files are supposed to represent is that large. If my understanding of shotgun sequencing is correct, this means that there's significant overlap between individual sequences which would allow for the sequences to be reconstructed into larger, contiguous sequences, dramatically reducing the size of the data and making it far easier to work with.

So far, the only promising lead I've found is an application by the name of ARACHNE, which appears to be exactly what I'm looking for, except that I don't have a sufficiently powerful Linux machine at hand with the correct software installed (although it might be possible to rectify this if no other options present themselves).

Short version: How can I go about turning this giant pile of tiny sequences into a smaller pile of larger sequences?

fastq fasta assembly • 2.1k views

ADD COMMENT • link updated 4.2 years ago by Biostar 20 • written 10.8 years ago by isa29 ▴ 10

score 1 · Answer 1 · 2013-07-17

1

Entering edit mode

10.8 years ago

stolarek.ir ▴ 700

what technology was used for the sequencing? Are these single end, pair end reads Does the reference genome exist for the alignment and with-reference assembly purpose?

These are some questions that you need to go and find an answer on your own. Without some understanding what you have it's pretty much pointless to try and do anything.

Read about sequence assembly (it's not that easy that when you have overlap it goes great). And yes, you need some computational power to do the job

ADD COMMENT • link 10.8 years ago by stolarek.ir ▴ 700

0

Entering edit mode

Thanks for the reply. I believe they were sequenced with the Illumina HiSeq platform. I'm not sure what single end and pair end reads are, but I'll look into that. Same for the reference genome (I suspect not, though).

I'll continue reading up on sequence assembly, and see if I can convince IT to install the necessary software on one of the more powerful computers we've got.

Thanks!

ADD REPLY • link 10.8 years ago by isa29 ▴ 10

1

Entering edit mode

http://elements.eaglegenomics.com/

here you have some tools used in bioinfromatics. It's presented in easy way. Go for assemblers (lon or short, you will know after some reading what is best suited for your type of reads)

ADD REPLY • link 10.8 years ago by stolarek.ir ▴ 700