Forum:Filtering out chromosomes from reference genome
0
3
Entering edit mode
3.0 years ago

Hello everyone,

One of the main analysis in Bioinformatics today is RNA-seq data processing and one of the first step is to align or map (I will talk about alignment here) reads against a reference genome or transcriptome.

I work on mouse, note that my question is applicable to well known species too. I retrieve my genome from GRC

In this file, I have listed all the entries that I classified as follow (in mouse) helping myself with this documentation :

• "Conventional chromosomes" (chr1-19, chrX, chrY)
• Primary assembly (chr1-19, chrX, chrY + unlocalized sequences (JH584293.1), unplaced sequences (GL456394.1))
• Genome Patches (Fixed patch (KV575232.1), Novel patch (KK082441.1))
• "Unknown from NCBI" (WSB_EIJ_MMCHR11_CTG1)

From yours experiences, just before the alignment, in which case do you filter out patched chromosomes, unlocalized sequences, unplaced sequences or unknown sequences (let's called all of these terms : not conventional chromosomes) and in which case you do not ?

To conclude on my RNA-seq data processing, I want to keep as much reads as possible on the "Conventional chromosomes" to create Circos Plot.

Related post :

Looking for a thorough annotation for non-primary assembly units in GRCm38

Remove patches from gtf file?

RNA-Seq Alignment Forum • 3.0k views
1
Entering edit mode

See also this blog post of Heng Li: Which human reference genome to use?

0
Entering edit mode

Until now I use Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz for mapping of my mRNA-Seq data for alternative splicing analysis. About this file the blog says:

1. Inclusion of multi-placed sequences. In both GRCh37 and GRCh38, the pseudo-autosomal regions (PARs) of chrX are also placed on to chrY. If you use a reference genome that contains both copies, you will not be able to call any variants in PARs with a standard pipeline. In GRCh38, some alpha satellites are placed multiple times, too. The right solution is to hard mask PARs on chrY and those extra copies of alpha repeats

Would switching the reference genome in my case lead to a higher mapping fidelity? Would this differences affect my analysis?

0
Entering edit mode

Thanks for this link, that helps me to see that filter out chromosomes is dangerous on variant calling due to false positive but does not help me much on RNA-seq data. Does everyone has his own way to do it ?

1
Entering edit mode

I want to keep as much reads as possible on the "Conventional chromosomes" to create Circos Plot.

Unless you are interested in haplotype specific expression and/or those other regions you could just keep the main chromosomes.

Have you checked to see what fraction of reads map to other categories of sequence? How do you handle multi-mappers in your current protocol?

0
Entering edit mode

No, I'm not interested in haplotype specific expression.

The fraction of mapped reads over these chromosomes are 70-100 reads over 1 000 000. I knew, it is not a lot, I could process my data without those, but I want something very clean

Multi-mapped reads are allowed in the alignment but will be discard downstream.

I got by exemple this read, which is also multi-map on the same chromosome. In any way I'll loose it further in my analysis because I filter on "conventional" chromosomes. My question was is it correct to remove these "no conventional" chromosomes to let a chance to this read to map on a "conventional" one.

At my very least, if it doesn't align it will be discard by the aligner.

M02945:167:000000000-AVYLG:1:2101:16288:6272    16  chrUn_JH584304  73160   11  230M21S *   0   0 ATTTTCAATTTTCTTTCCATGTTCCACGTCCTACAGTGGACATTTCTAAATATTCCAACTTTTTCAGTTTTCCTCGCCGTATTTCATGTCCTAAAGTGTGTATTTCCCATTTTCCGTGATTTTCAGTTTTCTCGCCATATTCCATGTACTACAGTGTGCATTTCTCATTTTTCAAGTTTTTCAGTGATTTCGTCATTTTTCAAGTCGTCAAGTGGATATTTCTCATTTTCTCAAGATTTTCTGACTTAGCA /BFFFB/FFFFFFBGFFFFF;C9EC;;/FBF;9:;0;;0;;;000HHHG=;<;0;0GC.FC:0<DGGHHHGGGCFCGHHHHHHHHHHHHHHHHHHGHHHHHHHHGFFHHHGGGGGHHHHHHHHHHHHHGGGGGHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHGHGHHGHHHHHHGGGGHGHHHHHHHHGGGGGHHHHHHHHHHHHHHHHFHHHHGHGGGGGGGGGGFFFFFFFBBBBB   AS:i:294    XS:i:287    XN:i:0  XM:i:19 XO:i:0  XG:i:0  NM:i:19 MD:Z:7G7G4A5T24T5C1A26C8A10T1T17C17G2C1C24C7T20T13G12YT:Z:UU


My main question was a general question. In what case do you have the right to filter out some chromosome to do an alignment

2
Entering edit mode

My question was is it correct to remove these "no conventional" chromosomes to let a chance to this read to map on a "conventional" one.

That is an interesting philosophical question. In grand scheme of things having ~100 reads (out of a million) map in locations where they should not have mapped is not going to make or break the experiment. There are many assumptions that go into these experiments. With mice even the type of strain you think you are working with has been subject to some ambiguity (interesting results with recent NeoGen genotyping chips). Thus even the reference you are using may not be the most appropriate in the first place but let us not go there.

In what case do you have the right to filter out some chromosome to do an alignment

As long as you are not discarding entire autosomes/sex chromosomes/MT it should be ok to filter out other unassigned DNA to simplify your life. I believe that is what iGenomes does with the bundles they provide.

0
Entering edit mode

I got the main idea, thank you all !