Question: How To Interpret Funky Vcf Output?
0
gravatar for Dan D
7.0 years ago by
Dan D6.8k
Tennessee
Dan D6.8k wrote:

A customer with twelve RNASEQ mouse samples sequenced on an Illumina HiSeq 2000 wanted to compare the expression numbers for a single gene. This is outside my realm of expertise at the moment, but after some research I came up with a strategy. I realize the following commands aren't the most efficient way to do things, but I erred on the side of caution:

I downloaded a reference transcriptome from UCSC for mm9 and indexed it with bwa and samtools.

I mapped each sample to the reference transcriptome using BWA:

bwa aln -t 10 [path to transcriptome fasta] [fastq.gz filepath]

I converted the sai to BAM:

bwa samse [path to transcriptome fasta] [path to sai file] [path to fastq] | samtools view -Shub -o [bam file name] -

I sorted the BAM with samtools:

samtools sort [path to bam file] [sorted bam prefix]

I generated a VCF:

samtools mpileup -f [path to transcriptome fasta] [path to sorted bam] -u | bcftools view - > [vcf output path]

When I open the VCF, the header looks like this:

    ##fileformat=VCFv4.1
    ##samtoolsVersion=0.1.18 (r982:295)
CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  [path to sorted bam]

And here are a few sample lines:

gi|126352347|ref|NM_028260.2|   108     .       G       X       0       .       DP=27;I16=13,14,0,0,1040,40426,0,0,520,10400,0,0,314,4688,0,0   PL      0,81,199

gi|126352347|ref|NM_028260.2|   109     .       G       X       0       .       DP=28;I16=13,15,0,0,1086,42418,0,0,540,10800,0,0,323,4795,0,0   PL      0,84,201

gi|126352347|ref|NM_028260.2|   110     .       A       X       0       .       DP=29;I16=14,15,0,0,1110,42864,0,0,560,11200,0,0,333,4957,0,0   PL      0,87,201

I'm having trouble correlating this with the official VCF 4.1 specification. Does anyone have any perspective on what I'm looking at here, or did I do something wrong? Specifically, column 1, column 5, the I16 comma-separated list in column 8, and the values in the last two columns are what I'm having trouble understanding, based on the samtools and VCF documentation.

ADD COMMENTlink modified 6.3 years ago by Biostar ♦♦ 20 • written 7.0 years ago by Dan D6.8k
3
gravatar for JC
7.0 years ago by
JC8.7k
Mexico
JC8.7k wrote:

First, because you are mapping your reads to the reference genome with BWA, you are missing all spliced reads, with long reads (>50b) this is really important. Second, after mapping your reads you don't need to generate a VCF file to infer the expression level, the VCF is for variants not expression.

Please, for RNAseq data consider to change your strategy, there are a lot of literature about how to analyze that, here in BioStar exists a section with Reviews, but in short I can suggest a TopHat + Cufflinks analysis that you can run on your machines or even in Galaxy.

ADD COMMENTlink modified 7.0 years ago • written 7.0 years ago by JC8.7k

JC is absolutely correct. Also, you may want to align to genomic sequence if you are variant calling.

ADD REPLYlink modified 7.0 years ago • written 7.0 years ago by Zev.Kronenberg11k

Thanks for the comments. One of the primary sources I drew from was this:

http://bsc2010.bioinformatics.ucdavis.edu/handson/rna_seq_statistics_workshop.html

But what you are saying definitely makes sense. I will try out some other approaches. Thanks!

ADD REPLYlink written 7.0 years ago by Dan D6.8k
1

Please forget my first point, I missed that your are mapping to the reference transcriptome, that's correct. After that you need to count how many reads mapped per transcripts and save that as matrix that can be used in R:Bioconductor edgeR or DESeq to normalize and identify DEGs. One tricky thing is the reads that map to more than one transcript, BWA by default select one random hit but TopHat will report more than one hit if they are in different locations in the genome, after that Cufflinks will resolve the proportions.

ADD REPLYlink modified 7.0 years ago • written 7.0 years ago by JC8.7k

Thanks so much for taking the time to follow up!

ADD REPLYlink written 7.0 years ago by Dan D6.8k
1
gravatar for Zev.Kronenberg
7.0 years ago by
United States
Zev.Kronenberg11k wrote:

you should also have the column headers:

CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  bamname

I haven't seen those 'Xs' in ALT.

You might want to try running varfilter before investigating your VCF files.

samtools-0.1.18/bcftools/vcfutils.pl varFilter
ADD COMMENTlink modified 7.0 years ago • written 7.0 years ago by Zev.Kronenberg11k

Thanks! I didn't include the headers at first, but I just added them in after reading your post.

I'll check out varFilter. Thanks for the tip!

ADD REPLYlink written 7.0 years ago by Dan D6.8k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 981 users visited in the last hour