Creating a fastq generator : how to handle the 3' ends of transcripts.
0
0
Entering edit mode
7.9 years ago

I am currently investigating the different spliceforms in an experimental sample. To get a better understanding of how the different spliceform finding software works, I created a program that generates fake fastq data.

Here is how it works :

  1. Read in GTF file
  2. Select transcript of interest from the GTF file.
  3. Generate random numbers for the start position of the read. So if my random number is 54, my read will start at position 54.

Step 3 is where I get into trouble. I'm not sure how to handle the end of the transcript. For example, say that I want 100 base reads in my fastq file. Let's say the transcript of interest is 2000bases long. If I draw a random number between 1-1900, I am fine. However, if I draw a number between 1901-2000, say 1950, I get into trouble because I don't know what to make the remaining 50 bases of the read.

A couple potential solutions I thought of:

  1. Randomly add sequences to the 3' end
  2. Pretend that I read into the Illumina (or similar) adapter.

What experimentally happens in this situation? Is there a bias against the ends of transcripts when doing size selection in RNA-Seq?

RNA-Seq transcript • 1.7k views
ADD COMMENT
0
Entering edit mode

Actually the bias is towards the 3'end if one is doing poly-A selection. I don't think #1 is a good idea, it would not be biologically relevant. You could look into 3'-UTR (or are you already taking those) and/or doing #2.

You could also look at published datasets where the truth is known (to some extent). Someone here may be able to provide a good example.

ADD REPLY

Login before adding your answer.

Traffic: 2697 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6