Question

Using Specific Regions of Genome as Reference for Alignment Using STAR

0

Entering edit mode

18 months ago

hkarakurt ▴ 180

Hello everyone, I have lots of transcriptome data (RNA-Seq and full transcript single cell RNA-Seq) and I need to align them to genome using STAR. But I have storage some problems.

I will focus on a group of genes in downstream analyses. Is it possible to align the reads to specific parts of genomes (gene group) while using STAR or creating a custom reference from genome fasta that includes only specific regions and use it as reference?

Thank you in advance

RNA-Seq Alignment STAR • 1.4k views

ADD COMMENT • link updated 18 months ago by Buffo ★ 2.4k • written 18 months ago by hkarakurt ▴ 180

1

Entering edit mode

18 months ago

Pierre Lindenbaum 161k

please, don't. See Exome Sequencing: Masking The Non-Genic Sequences ?

ADD COMMENT • link 18 months ago by Pierre Lindenbaum 161k

0

Entering edit mode

So doing such a thing in the alignment step practically causes mis-aligned reads. I will try to align all reads to whole genome and extract the regions using a bed file. I believe this will not create false positives or mis-aligned reads (at least not as much as the previous scenario).

Thank you for your answers.

ADD REPLY • link 18 months ago by hkarakurt ▴ 180

0

Entering edit mode

18 months ago

Buffo ★ 2.4k

If the problem is the storage capacity, you can filter the bam file to those regions of interest, see this post: Extract Reads From A Bam File That Fall Within A Given Region.

ADD COMMENT • link 18 months ago by Buffo ★ 2.4k

0

Entering edit mode

I would not do that. With a few genes alone you are not going to do a meaningful analysis, for example normalization and DE needs a fair amount of genes to be robust. Even more so on single-cell level for QC purposes. If storage is limited then do it file by file, get bam, then the count matrix for a single sample. Delete bam, next one. Eventually concat the matrices into a single one. Or use salmon for everything which produces counts directly.

ADD REPLY • link 18 months ago by ATpoint 82k

0

Entering edit mode

I wouldn't do that either:

Is it possible to align the reads to specific parts of genomes (gene group) while using STAR

I only proposed an alternative to the problem assuming that further analysis doesn't need information about other regions:

But I have storage some problems.

That might be a better solution (you should post it as an answer to the question, not to my answer):

If storage is limited then do it file by file, get bam, then the count matrix for a single sample. Delete bam, next one.

ADD REPLY • link 18 months ago by Buffo ★ 2.4k

score 3 · Accepted Answer · 2022-11-09

3

Entering edit mode

18 months ago

GenoMax 142k

That is possible but the question is it appropriate to do so.

If your data comes from entire genome/transcriptome then using a reduced representation reference always leads to the chance that STAR will try to align things to a location they may not have originated from.

ADD COMMENT • link 18 months ago by GenoMax 142k