I was recently given 4 paired-end .fastq files, each from a different strain of bacteria. I was told each .fastq file should have about 50 contigs and that each contig should have a length <200KB with read depths of about 40. My assignment is to 1) assemble each strain into its full circle and 2) perform a pairwise comparison between the four strains to discover insertions, deletions, and duplications between them. I have not worked with bacterial genomes and have several unclear points that I hope to clear up in order to proceed. I apologize in advance they may be basic.
Below is an example of the first four lines of a .fastq file:
@M06340:7:000000000-G3778:1:1101:16358:1365 1:N:0:NCAGTG AATCCAGCTTTCAGTCTTTCCTATTACTTTTCAAATGATTGATAGAATT (this line continues for a total of ~150 characters) + CCCCCFFFFFFFGGGGGGGGGGHHHHHHHHHHHHHHHHHHHHHHH (this line continues for a total of ~150 characters)
1) Is it correct for me to believe that I cannot verify the (stated to me) number of contigs (~50), the contig average length (<200KB), and the read depth (~40) directly from the .fastq file directly? The .fastq files have about 1 million lines (so about 250,000 reads) and each read has ~150 characters). If I cannot verify these parameters directly from the .fastq file, what is the most simple method to do so?
2) Is it possible for me to determine what kind of sequencer was used to generate this data based on the .fastq file? (This is mostly to assist me in determining appropriate software to use for the next steps if necessary).
3) I was asked to “assemble each strain into its full circle”. I do not have access to any proprietary/paid software. However, I do have basic Linux skills and intermediate/advanced R skills. What software would be suggested given my file format about (.fastq)? I am hoping to find software that has helpful tutorials for those without skills working with bacterial genomes (but this may not be available). Is it generally necessary to assemble into a “full circle” or is it sufficient to simply assemble linearly (particularly given that I wish to determine insertions, deletions, and duplications in the next step)?
4) What is a very basic tool to determine insertions, deletions, and duplications between the assembled strains? Again, I do not have access to proprietary/paid software but have skills in Linux and R.