I am currently investigating the different spliceforms in an experimental sample. To get a better understanding of how the different spliceform finding software works, I created a program that generates fake fastq data.
Here is how it works :
- Read in GTF file
- Select transcript of interest from the GTF file.
- Generate random numbers for the start position of the read. So if my random number is 54, my read will start at position 54.
Step 3 is where I get into trouble. I'm not sure how to handle the end of the transcript. For example, say that I want 100 base reads in my fastq file. Let's say the transcript of interest is 2000bases long. If I draw a random number between 1-1900, I am fine. However, if I draw a number between 1901-2000, say 1950, I get into trouble because I don't know what to make the remaining 50 bases of the read.
A couple potential solutions I thought of:
- Randomly add sequences to the 3' end
- Pretend that I read into the Illumina (or similar) adapter.
What experimentally happens in this situation? Is there a bias against the ends of transcripts when doing size selection in RNA-Seq?