Hello , I am having headaches regarding this sample.
I am performing WES for calling SNP and CNV. All was good until I wanted to show the coverage by gene or fragment. Comparing my results with other ones, I found no coverage by the region I wanted, compared with the results I stand "for gold standard" (indeed they are from a bioinformatics company).
I ask them what happens, and they told me that may be the mistake come from the previous step to the alignment with BWA-MEM
Thus, the reads are 100 bp, and the adapters are CTGTCTCTTATACACATCT+AGATGTGTATAAGAGACAG. Looking at the fastq files is a Novaseq instrument. But only on the fastq is this sequence: AGATGTGTATAAGAGACAG (as adapter)
Furthermore, on my fastqc results, it says that there is no adapters sequence. (and the other sections are OK)
However, the per-base sequence content looks like this.:
Another question is if there are samqualfilter tool, or it is propiertary from the company? Because I googled and I can't find anything.
Please, if anyone knows how to proceed with trimming step on WES (Novaseq) by "smoothing" this graph and the optimal parameters I would be very grateful.
Thanks
That feature you see at beginning of the reads likely comes from tagmentation step used in creation of the libraries. You should not need to
smoothen
that out since that sequence should be valid and will align fine.If that is the case trying to attribute to choice of aligner is a bit of a stretch. If you have not coverage in that region then there is not much you can do. You also may want to see if your reads are multi-mapping (is that region present multiple times in genome or does it have repeats)?
Thank you for your early response.
Thus, I was looking for multi-mapping ( is like, a gene is mapped on a canonical chromosome and non-canonical ?) Qualimap does not report this (or it does ? I haven't found this information).
Moreover, multi-mapping is related to duplicated reads?
I followed several galaxy tutorials (they does not include GATK thus I didn't use this tool)
I somewhat workaround, and I found that if I compare the coverage just with only the canonical chromosomes, then I get coverage of the region of interest.
Thus the question that follows: Should I align only with the canonical chromosomes?
What is the research question you are trying to address? Answer to the question above would be dependent on that. You can decide what you want to do. I assume by canonical you mean (primary assembly without haplotypes etc)? People would generally use that for most of analyses.
Thank you for your reply.
What I want to do is to search for indels and CNVs for a germline disorder such as Marfan Syndrome or Congenital Adrenal Hyperplasia. I only have one sample. I do not have the trio. Thus, I am referring in my analysis only to the primary assembly. Though, on the alignment, I do have alignment in haplotypes chromosomes, etc
So my question is, do I need to align the fastq files with the whole genome or just the primary assembly ? Or after the alignment just filter by quality?
For your last sentence, I assume I just need to align for that purpose with the primary assembly I think.
You can start with the primary assembly. I think it should be fine for your purpose but someone more experienced may offer a specific answer.
Dear GenoMax, just one thing, if WES sequences exomes, should I align only with the exons of the primary assembly?
Align to the genome and then use the lists of captured regions that came with your kit/design to check on efficiency/coverage.
Thanks for your reply. I have a question.plotCoverage from deeptools gives per base coverage. How could I obtain for the whole exome? Bc multiplying by the total number of bases gives a ridiculous results