Question: Proper Way To Map Rna-Seq Data Against A Single (Or Small) Number Of Genes
3
gravatar for Jason
5.1 years ago by
Jason60
United States
Jason60 wrote:

I have a large Illumina RNA-Seq dataset, and I have already mapped it to the reference genome using STAR and done quantification. But now I want to look at expression of GFP which is not native to the species (as this is a transgenic mouse).

I imagine the 'proper' way to do this is to create a new reference genome with the GFP gene added as an extra chromosome. But this would then require a lot of duplicated work, space, and time.

What I tried to do is create a new reference index with the single GFP gene, and then align against that, but STAR creates a 1.5GB index for this single gene, and what if I want to do this with more genes? This seems to using STAR outside the type of work it was originally designed for. Or is this in fact the correct approach?

EDIT:

Am I missing anything obvious here, like using BLAST or BLAT (I don't have any experience with these older tools)? Thanks.

rnaseq gene alignment mapping • 4.9k views
ADD COMMENTlink modified 3.5 years ago by cpad011211k • written 5.1 years ago by Jason60

Is GFP fused to something or is it being expressed by itself? You might just try bowtie2 or bwa, which should have smaller indexes and be fast enough for your purposes.

BTW, do you have the unmapped reads (this is an option for STAR)?

ADD REPLYlink written 5.1 years ago by Devon Ryan89k

Expressed by itself. Does that make a difference? And no, I didn't save the unmapped reads from the original mapping.

ADD REPLYlink written 5.1 years ago by Jason60

Only in that if it were fused to something else then you might get somewhat better results by putting the fusion protein in. Otherwise, no, that doesn't matter too much. Too bad you didn't save the unmapped reads, that would have made life simple :)

ADD REPLYlink written 5.1 years ago by Devon Ryan89k

Wouldn't that affect the alignment rate, so the counts from the native genes wouldn't be comparable to the GFP counts?

ADD REPLYlink written 3.7 years ago by igor7.6k

Why All The Capitals haha ;-) ?

ADD REPLYlink modified 3.5 years ago • written 3.5 years ago by Irsan6.8k
1
gravatar for seidel
5.1 years ago by
seidel6.8k
United States
seidel6.8k wrote:

I've done this with bowtie to count GFP or the ERCC spike-in controls. A bowtie index of GFP and a few other genes came out to 4 MB.

ADD COMMENTlink written 5.1 years ago by seidel6.8k

I didn't think the indices would be so much smaller, but I guess the burroughs-wheeler transform of a small sequence is itself small (unlike the seed hash tables of STAR). Thanks!

ADD REPLYlink written 5.1 years ago by Jason60
0
gravatar for mathieu.bahin
3.5 years ago by
France
mathieu.bahin40 wrote:

Hi,

I have a similar question, I have a TE fasta file (that I got from bedtools) looking like that:

>Chr1:11896-11976(+)
CCCTTTCTTAGCAAATTGATCATCATCGCCATCATCACCATCATCATTATCATCATCATGATCAGTCGATAAATTTAGTC
>Chr1:16882-17009(-)
TTACACCCCATACCTTCCTAGTTTTATCTATGTACGTAGCAGCTTTTTAAAACGACCAAATTCTTAGCATTTCTCTATGGCTATAGGACAGTACGTTGTATAGAAAAGTTTAAATTGAAAAACAAAA
>Chr1:17023-18924(+)
TTAGGAAATACATTTTAAATAT...

 

How can I index this 'genome' with STAR?

I would like to map reads on that. The TEs are in the original complete fasta file, maybe finding them out after mapping on the whole genome is a better way?

 

Cheers,

Mathieu

ADD COMMENTlink modified 3.5 years ago • written 3.5 years ago by mathieu.bahin40
1

Please post things like this as new questions.

I would recommend that you do the following:

  1. Delete the TE fasta file, you don't need it.
  2. Align against the whole genome.
  3. Use the BED file that you used with BEDtools to subset the alignments according to whether the overlap one of your TEs.

Doing it that way will produce fewer false positives and a higher overall alignment rate.

ADD REPLYlink written 3.5 years ago by Devon Ryan89k
0
gravatar for cpad0112
3.5 years ago by
cpad011211k
India
cpad011211k wrote:

My understanding is that RNAstar indexing allows multiple fasta files being indexed in genome dir. Probably you can keep both the host genome and GFP as individual fasta (and corresponding gtf) files in genome dir and index them.  Check if STAR uses GFP reference.

ADD COMMENTlink written 3.5 years ago by cpad011211k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 856 users visited in the last hour