One of the main analysis in Bioinformatics today is RNA-seq data processing and one of the first step is to align or map (I will talk about alignment here) reads against a reference genome or transcriptome.
I work on mouse, note that my question is applicable to well known species too. I retrieve my genome from GRC
In this file, I have listed all the entries that I classified as follow (in mouse) helping myself with this documentation :
- "Conventional chromosomes" (chr1-19, chrX, chrY)
- Primary assembly (chr1-19, chrX, chrY + unlocalized sequences (JH584293.1), unplaced sequences (GL456394.1))
- Genome Patches (Fixed patch (KV575232.1), Novel patch (KK082441.1))
- "Unknown from NCBI" (WSB_EIJ_MMCHR11_CTG1)
From yours experiences, just before the alignment, in which case do you filter out patched chromosomes, unlocalized sequences, unplaced sequences or unknown sequences (let's called all of these terms : not conventional chromosomes) and in which case you do not ?
To conclude on my RNA-seq data processing, I want to keep as much reads as possible on the "Conventional chromosomes" to create Circos Plot.