A customer with twelve RNASEQ mouse samples sequenced on an Illumina HiSeq 2000 wanted to compare the expression numbers for a single gene. This is outside my realm of expertise at the moment, but after some research I came up with a strategy. I realize the following commands aren't the most efficient way to do things, but I erred on the side of caution:
I downloaded a reference transcriptome from UCSC for mm9 and indexed it with bwa and samtools.
I mapped each sample to the reference transcriptome using BWA:
bwa aln -t 10 [path to transcriptome fasta] [fastq.gz filepath]
I converted the sai to BAM:
bwa samse [path to transcriptome fasta] [path to sai file] [path to fastq] | samtools view -Shub -o [bam file name] -
samtools sort [path to bam file] [sorted bam prefix]
I generated a VCF:
samtools mpileup -f [path to transcriptome fasta] [path to sorted bam] -u | bcftools view - > [vcf output path]
When I open the VCF, the header looks like this:
##fileformat=VCFv4.1 ##samtoolsVersion=0.1.18 (r982:295) CHROM POS ID REF ALT QUAL FILTER INFO FORMAT [path to sorted bam]
And here are a few sample lines:
gi|126352347|ref|NM_028260.2| 108 . G X 0 . DP=27;I16=13,14,0,0,1040,40426,0,0,520,10400,0,0,314,4688,0,0 PL 0,81,199 gi|126352347|ref|NM_028260.2| 109 . G X 0 . DP=28;I16=13,15,0,0,1086,42418,0,0,540,10800,0,0,323,4795,0,0 PL 0,84,201 gi|126352347|ref|NM_028260.2| 110 . A X 0 . DP=29;I16=14,15,0,0,1110,42864,0,0,560,11200,0,0,333,4957,0,0 PL 0,87,201
I'm having trouble correlating this with the official VCF 4.1 specification. Does anyone have any perspective on what I'm looking at here, or did I do something wrong? Specifically, column 1, column 5, the I16 comma-separated list in column 8, and the values in the last two columns are what I'm having trouble understanding, based on the samtools and VCF documentation.