Question: remove scaffold and other unplaced sequence before mapping ?
0
gravatar for yongxpeng
3.7 years ago by
yongxpeng0
China/Beijing/PKU
yongxpeng0 wrote:

Hi,
I downloaded reference genomes from Ensembl (fasta format). But there are lots of sequences with name "dna:scaffold": https://github.com/CTLife/TEMP/tree/master/RefGenomes

Such as Mouse_GRCm38 (mm10), except chromosome 1-19, Mt, X and Y; others should be removed before mapping ? https://github.com/CTLife/TEMP/blob/master/RefGenomes/Mouse_GRCm38.p4.txt

Such as Human_GRCh38.p5 (hg38), https://github.com/CTLife/TEMP/blob/master/RefGenomes/Human_GRCh38.p5.txt, there are 516 sequences. In addition to chromosome 1-22, Mt, X and Y; others (such as CHR_HG2241_PATCH and KI270728.1) should be removed before mapping ?

ADD COMMENTlink modified 3.7 years ago by abascalfederico1.1k • written 3.7 years ago by yongxpeng0
0
gravatar for abascalfederico
3.7 years ago by
abascalfederico1.1k
Spain
abascalfederico1.1k wrote:

The latest release of the human genome (don't know about mice) contains alternative contigs. You will need an alternative-contig aware algorithm like BWA: https://github.com/lh3/bwa/blob/master/README-alt.md

If you are not using one of this kind of algorithms it would be better to remove the alternative contigs. That's because a read may map to multiple alternative contigs and be (incorrectly) considered a non-uniquely mapped read.

HTH

ADD COMMENTlink written 3.7 years ago by abascalfederico1.1k

OK, thank you. I am using BWA, Bowtie2 and Subread for ChIP-seq reads mapping. But for RNA-seq reads, the alternative contigs must be removed ?
How do you think about https://sequencing.qcfail.com/articles/genomic-sequence-not-in-the-genome-assembly-creates-mapping-artefacts/ ? It is a nice explanation of why we might not want to remove those extra sequences until after mapping.

ADD REPLYlink written 3.7 years ago by yongxpeng0

If I understood well that link is about repetitive sequences, not about alternative contigs

For RNA-seq... it depends. For example, if you want to analyse HLA genes, which are highly diverse, you would need the alternative contigs. I guess most people just ignore alternative contigs because of the increase in complexity.

ADD REPLYlink written 3.7 years ago by abascalfederico1.1k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1918 users visited in the last hour