Entering edit mode
4.0 years ago
We are currently working on a project involving an SNP analysis on Mycobacterium tuberculosis genome and are working with GATK/SAMtools pipeline.
We have downloaded SRA data (fastq files). We plan to run a fastqc, but before that we wanted to know if there is a way to determine the quality of the file prior to that. For example: It was suggested to us that larger file size indicates better quality reads.
Are there any other flags that we can see on face value, prior to running a fastqc, that can indicate the quality of the SRA data?
There is no correlation between quality and data size. Larger files would mean larger number of reads but that is about it.
While you have not asked, do you have to use GATK for bacteria? It is going to make your analysis more difficult.
Thanks for that! Are there any alternatives to GATK that are more suitable to bacteria?
Did you tried snippy?
Agreed, and adding on this larger files can also simply mean longer reads without that quality is any different. You will always have to download and perform QC yourself in order to get an idea of th quality.