The following is a very loud brainstorming for a beginner
A custom library prepration protocol, which uses 3' poly T tail primer and 5' random primer to enrich for 3' ends. and then eventually adds the illumina adaptersat the the end...and performs a Illumina HTSEq single end sequencing... The rationale is that the group is not interested in splice variants but merely in quantification of the expression for differential analysis. (organism: mouse)
I was wondering what would be the pipeline that one would use... in this case!
The reads are of length 95 and I trim the barcodes and adapters away and a fine stretch of 35 bases usually when mapped using bowtie yields 70% total reads aligned to transcriptome.
Bowtie Alignment: bowtie -v 3 --best -t ~/RNAseq/RefLibIndex/mouse -p 10 -f <inputfile.fasta> <output.map>
My output file contains entries in the following format with refseqID.
After I extract the raw read counts thru a custome script from the bowtie output file and I map the refseqID to genesymbol and extract the average/max for each genesymbol as raw counts and perform DESeq differential analysis.
I would like to know:
- Is this approach valid, do u see any loop holes here?
- To visualise this data in a genome browser to check for 3' enrichment. However, I see that there is no chromosome or index information in the file. I tried to get some information Visualizing Bowtie Output In Genome Browser. However, i couldnt succeed at that... any suggestions will be great... something like the following for visualization.. This one uses IGV
- Is there any reason why only 70% is mapped to transcriptome, if so, do u think it is normal for this approach, do you think I should also try to align the sequences to genome?
Thanks!
Thanks!!
I have used the refseq sequences to build the transcriptome.... and as much as I understand refseq sequences are basically individual transcripts/isoforms (Correct me if I am wrong).. so this is taken care isnt it...? The reason why I compute average/max of each gene symbol is to derive a better representation of each gene from the data. I shall now try to map unmapped sequences to genome for sure.. thanks for that!
for two can u suggest a little more in detail how can I use the refseq sequences as reference? I am a bit lost here...
arent these just many fasta sequences....This would make make too many references sequences right?? (number of reference sequence = number of refseq Sequences)... unlike genome which is a single long sequence (basically number of sequence here = number of chromosomes, I would assume)...
You can use tophat for alignment (http://tophat.cbcb.umd.edu/) Also take a look at Cufflinks for quantification (http://cufflinks.cbcb.umd.edu/)
Hmm..yes, I think you did the right thing. Perhaps another thing to try is to check whether your reads contain adapter sequences (which may affect your mapping percentage since you're doing an end-to-end mapping and also to play around with the
-v
value.In IGV, for example, I think you need to create a new
.genome
file (Genomes -> Create .genome file) out of your reference transcriptome. The same is possible in UCSC genome browser, although you need a local install and setup (which is considerably more complex). There will indeed be many reference sequences, but since your SAM records are indexed by these transcripts, the genome browser needs a reference build of these transcripts too.