Question

Assemble bacterial .fastq files and find differences in insertions, deletions, and duplications

0

Entering edit mode

4.7 years ago

LRStar ▴ 200

I was recently given 4 paired-end .fastq files, each from a different strain of bacteria. I was told each .fastq file should have about 50 contigs and that each contig should have a length <200KB with read depths of about 40. My assignment is to 1) assemble each strain into its full circle and 2) perform a pairwise comparison between the four strains to discover insertions, deletions, and duplications between them. I have not worked with bacterial genomes and have several unclear points that I hope to clear up in order to proceed. I apologize in advance they may be basic.

Below is an example of the first four lines of a .fastq file:

@M06340:7:000000000-G3778:1:1101:16358:1365 1:N:0:NCAGTG
AATCCAGCTTTCAGTCTTTCCTATTACTTTTCAAATGATTGATAGAATT (this line continues for a total of ~150 characters)
+
CCCCCFFFFFFFGGGGGGGGGGHHHHHHHHHHHHHHHHHHHHHHH (this line continues for a total of ~150 characters)

1) Is it correct for me to believe that I cannot verify the (stated to me) number of contigs (~50), the contig average length (<200KB), and the read depth (~40) directly from the .fastq file directly? The .fastq files have about 1 million lines (so about 250,000 reads) and each read has ~150 characters). If I cannot verify these parameters directly from the .fastq file, what is the most simple method to do so?

2) Is it possible for me to determine what kind of sequencer was used to generate this data based on the .fastq file? (This is mostly to assist me in determining appropriate software to use for the next steps if necessary).

3) I was asked to “assemble each strain into its full circle”. I do not have access to any proprietary/paid software. However, I do have basic Linux skills and intermediate/advanced R skills. What software would be suggested given my file format about (.fastq)? I am hoping to find software that has helpful tutorials for those without skills working with bacterial genomes (but this may not be available). Is it generally necessary to assemble into a “full circle” or is it sufficient to simply assemble linearly (particularly given that I wish to determine insertions, deletions, and duplications in the next step)?

4) What is a very basic tool to determine insertions, deletions, and duplications between the assembled strains? Again, I do not have access to proprietary/paid software but have skills in Linux and R.

Assembly insertions deletions bacteria • 1.2k views

ADD COMMENT • link updated 4.7 years ago by swbarnes2 14k • written 4.7 years ago by LRStar ▴ 200

score 3 · Accepted Answer · 2019-08-26

3

Entering edit mode

4.7 years ago

swbarnes2 14k

What you have been given to do is not trivial with short reads. Assembling into contigs can be done, but assembling the contigs together is not trivial, because short reads make it hard to work out repetitive regions.

1) How many contigs a fastq can assemble to depends on the parameters given to the algorithm you choose to use. It's not some simple feature that you can determine by inspecting the fastq file.

2) It doesn't matter much, but you can tell from the read name that this data is from an Illumina MiSeq

3) I'd try Spades or velvet to make contigs. I'm not sure what the preferred software for trying to assemble those contigs is today.

ADD COMMENT • link 4.7 years ago by swbarnes2 14k

0

Entering edit mode

4) MUMmer does whole genome alignments and should be able to answer this question. Haven't tried haploclique, but that could work as well.

swbarnes2 already gave you good suggestions for other questions. Links to SPAdes (which I recommend if you are not pressed for time and memory) and Velvet.

ADD REPLY • link 4.7 years ago by Mensur Dlakic ★ 27k