I am working on Hiseq2000/2500 single end reads on RNASeq leukemia samples. I am interested in aligning all the reads beloging to the Immunoglobulin genes (Ig) for further analysis. The task is difficult for two main reasons:
- Final Ig genes are the result of a "collage" of genomic regions and alignment is difficult;
- Ig are hypermutated, meaning that there are a lot of mutations interfering with the alignment;
I decided to develop my own pipeline since those available are inappropriate. I post a brief resume of the workflow:
- Star alignment agains hg38;
- Extraction of unmapped reads;
- Extraction of reads mapping to the Ig locus;
- De novo construction of the Ig with Trinity;
- Alignment of de novo sequences with IgBlast;
- Select the "correct" Ig;
- Align sequences against the custom Ig reference with bowtie2 (--very-sensitive-local).
I've analyzed around 100 samples, 70% of those align correctly displaying a uniform coverage. The 30% of samples show a non-uniform coverage profile (See Fig. attached). As you can see, there is the common 3' bias with the coverage increasing at the end of the Ig. However, there is a huge drop in the middle of the Ig and I cannot understand whether is a kind of artifact lacking those reads or is the alignment parameters wrong.
To resume, I cannot understand how is possible to observe different coverage profiles for samples belonging to the very same experiment. Any suggestion will be helpful