Question: RNASeq alignment and quantification for a transgenic mouse
2
gravatar for viktorfeketa
15 months ago by
viktorfeketa20
viktorfeketa20 wrote:

I have RNASeq data coming from a transgenic mouse (where a single gene's coding sequence is replaced by the gene sequence from another organism). I need to quantify the expression of this transgene (get the number of aligned reads). It seems to me that the most comprehensive and accurate approach would be the following:

1) generate a custom reference genome (.fasta file), where the respective sequence is replaced with the new transgenic sequence 2) modify the entries for this gene in the gene annotation (.gtf) 3) use the modified reference genome and gene annotation to do the alignment and gene quantification.

Does that sound like the correct approach, or are there some issues that I don't see?

Also, I have problems with implementing my plan. For #1, I couldn't find a good tool to replace the sequence with another one in the .fasta file - could you recommend something? I am not proficient with python, so would prefer a ready-made tool.

Another concern that I have is that because the original and new sequences in the fasta file have different lengths, the whole annotation (or at least one chromosome) will be misaligned in relation to the modified reference genome. How can this be resolved?

Could anyone suggest other general approaches? Would it be more straightforward to use just a single gene sequence as a reference, and align the whole dataset to it? If yes, then what tool would you recommend?

ADD COMMENTlink modified 15 months ago by swbarnes27.0k • written 15 months ago by viktorfeketa20

I need to quantify the expression of this transgene (get the number of aligned reads).

Are you not interested in what happens to the rest of the mouse transcriptome (probably not, but good to confirm). The gene you replaced was a single copy (with no other genes that were similar in sequence) elsewhere in the genome? Was the replacement confirmed to be a single copy (or is that something you need to check)?

ADD REPLYlink written 15 months ago by genomax75k

1) we are definitely interested in the whole transcriptome, but for this it seems to me that using the usual RNAseq pipeline with the original wild-type genome is sufficient - only the targeted gene will be affected, reads for all the other genes should be aligned/quantified normally - isn't that right? 2) the replacement was not confirmed as a single copy - this would be useful to check; how does that affect the approach?

ADD REPLYlink written 15 months ago by viktorfeketa20
1
gravatar for swbarnes2
15 months ago by
swbarnes27.0k
United States
swbarnes27.0k wrote:

What might be simpler, and work just about as well, is to leave the orignal genome intact, and just append the transgene as a new chromosome to the genome. Then remake the index, and add the new entry to the gff.

ADD COMMENTlink written 15 months ago by swbarnes27.0k

The concern that I have with this approach is that because the sequences are conserved to a large degree (the gene is the ortholog from a different organism), couldn't that result in some reads aligning (with imperfect alignment) to the original gene, and all reads being split between the original gene and the transgene? Wouldn't that result in unreliable quantification?

ADD REPLYlink written 15 months ago by viktorfeketa20

where a single gene's coding sequence is replaced by the gene sequence from another organism

If that was the case (as noted in original question and there are no other copies of the gene elsewhere) why should there be any worry about following?

couldn't that result in some reads aligning (with imperfect alignment) to the original gene, and all reads being split between the original gene and the transgene?

ADD REPLYlink written 15 months ago by genomax75k

In this case, the read counts for the transgene will include only a fraction of the total reads that actually came from the transgene (the rest of the reads aligning to the original gene) - thus, underestimated quantification. But now that I think about it, I could just add the read counts to both the transgene and the original gene, knowing that in reality the original gene sequence is not present in the genome of transgenic animal, and all these reads should theoretically come from the transgene. Would that be accurate do you think (adding counts)?

ADD REPLYlink written 15 months ago by viktorfeketa20

knowing that in reality the original gene sequence is not present in the genome of transgenic animal, and all these reads should theoretically come from the transgene.

If the replacement happened cleanly (was the replacement done in egg/sperm?, not sure if your data is from single cells or multiple) then your counts should only be from the transgene.

If replacement happened in some of the copies/cells then as @swbarnes2 suggested above you will need to create a hybrid genome. It would be tricky to assign counts especially if the replaced gene is largely identical to the original.

ADD REPLYlink modified 15 months ago • written 15 months ago by genomax75k

I have a similar issue to that presented above with the exception that no gene was replaced. Instead a human gene with no-mouse ortholog together with GFP under the same promoter.

How do I append and new chromosome to the genome to quantify expression?

Thanks!

ADD REPLYlink written 6 months ago by Agustin Gonzalez-Vicente50

Add the sequence of the gene at the end of the mouse genome as a new entry. Reindex the modified genome and then align to the new indexes.

Is the location of the insertion known?

ADD REPLYlink written 6 months ago by genomax75k

Thanks, do you know of any detailed tutorial to do that?

I am working with RNAseq from GEO, there is no info about the location of the insertion.

ADD REPLYlink written 6 months ago by Agustin Gonzalez-Vicente50
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1167 users visited in the last hour