I am interested in the identification of the 'Viromes' in the wheat samples infected with viruses. For this purpose, I am not sure about the pipeline to be used. I have sRNA data from illumina and I am following these steps
- Quality check of the reads
a. Raw reads -> Trim adapters and filtered reads (FASTQC, cutadapt and Trimmomatic)
- Mapping on the host genome to find host-specific reads
a. building the indexes from the whole wheat genome (bowtie2, GMAP) (getting an error due to the size of the genome)
b. Mapping of reads to the reference genome (Tophat, SAMTOOLS)
*. Would it be better to align them to RNA sequences from wheat instead of the whole genome?
De-no assembly of the unmapped reads (velvet, kmer - 17)
Mapping of contigs to the reference genome from step 2 (bowtie2, tophat, samtools)
BLASTN Unmapped contigs against virus databases in the NCBI/Genebank
BLASTX against virus protein database.