I'm new to bioinformatics and I've been tasked with performing analysis of RNA-seq data. Essentially I have lots of Illumina paired reads from different samples that I want to map to a reference and carry out differential expression analysis with.
Since the species in question is a non-model species, there is no sequenced, annotated genome available. I also cannot assemble a transcriptome de novo.
Instead, I have a set of several thousand 'contigs' assembled from thousands of 'expressed sequence tags' (ESTs). These contigs have been annotated in so far that each one has been given a (probable) biological function. For example:
Contig ID : Contig0001 Contig Description : rrm-containing protein Contig Length : 1235 1:ATACAGCTTGGAAAATTAAATACCTTTGCACTCTGTCTGATCACCTTCACAGCCTTGCTT 61:AGATTCCTTTTCTCTTCTCTATTTCTTCTTCTTCTTTTTTTGACATGGAAGAGGAAGAGC [...] 1201:TTGTAGACATCCATTTTGTATACTCGGAATTTCTA
Contig ID : Contig0002 Contig Description : chaperone protein dnaj 16-like isoform x1 Contig Length : 948 1:TTTCTATTTTGCCTTCGATTAATTTTCATCTTTCAATAAGTTTTACTTTAATTTCTTCCG 61:ACTTCATTTTATTCACGCAACATTTCACCATTGAGTTTGCCAACTGAAGAGAACCGTAGC [...] 901: CAGACAGCTTCGATGACAGTCAATAAGGATCCAGATGCGACATTTTTC
1) What exactly could these contigs represent? I initially thought they were complete transcripts, but I'm now thinking they only form part of transcripts?
2) How would I go about mapping my reads to these contigs? Is that even possible? It looks like the file isn't in fasta format - would a more suitable format look something like this?
>Contig0001|rrm-containing_protein|1235 ATACAGCTTGGAAAATTAAATACCTTTGCACTCTGTCTGATCACCTTCACAGCCTTGCTT AGATTCCTTTTCTCTTCTCTATTTCTTCTTCTTCTTTTTTTGACATGGAAGAGGAAGAGC [...] TTGTAGACATCCATTTTGTATACTCGGAATTTCTA
3) If I did manage to map my reads and perform quantification - what exactly am I quantifying? Would it be transcript-level quantification or gene-level quantification?
Any help would be greatly appreciated! Let me know if you need more info.