Question: Snp Discovery With Rnaseq And Reference
gravatar for Leszek
9.4 years ago by
IIMCB, Poland
Leszek4.1k wrote:


I've data from several RNAseq experiments. I've mappend the reads onto reference and want to screen for SNPs with this. I'm able to catch them by visualising the alignment in IGV, but I cannot find them with samtools:

  samtools mpileup -ugf gem/$s.fa gem/${s}.bam | bcftools view -bvcg - > gem/${s}.raw.bcf
  bcftools view gem/${s}.raw.bcf | varFilter -D 10000 > gem/${s}.flt.vcf

I think it may be due to variable coverage and low depth for RNAseq. Could you recommend me some other tools that might be used with RNAseq reads?


I've wrote small script solving the problem. It incorporates only coverage and frequency of to call heterozygous sites. Hope someone else will benefit from this.


rna reference snp • 5.2k views
ADD COMMENTlink modified 9.2 years ago by lh332k • written 9.4 years ago by Leszek4.1k

I was trying your script, But got following error. I have no idea about python.

cat test_place/accepted_hits.mpileup | python -o test_place/het_test -f 0.2 -d 10

Traceback (most recent call last): File "", line 105, in <module> main( sys.argv[1:] ) File "", line 101, in main parse_mpileup( parameters["fpath"],parameters["out_base"],parameters["minDepth"],parameters["minFreq"] ) File "", line 53, in parse_mpileup contig,pos,base,cov,alg,quals=line.split('\t')#; contig,pos,base,cov,alg,quals ValueError: need more than 1 value to unpack

Could you help me ?

ADD REPLYlink modified 7.6 years ago • written 7.6 years ago by Gvj440

something is wrong with you mpileup file. each line should contain 6 columns separated by tabs. check that.

ADD REPLYlink written 7.6 years ago by Leszek4.1k

It has 6 columns; head testplace/acceptedhits.mpileup

contig_10 86 C 1 ^S. C

contig_10 87 A 2 G^SG C@

contig_10 88 C 2 .. CC

contig_10 89 C 2 .. FC

contig_10 90 A 2 .. FF

I have "><"values in 5th Column, What does it mean ? From the pileup format I can't find a explanation for this.

ADD REPLYlink modified 7.6 years ago • written 7.6 years ago by Gvj440

I know what may be the problem. The script is updated. Let me know if it works.

ADD REPLYlink written 7.6 years ago by Leszek4.1k

Thanks for your time. But unfortunately it doesn't work. This time every line is printed as an STD error and nothing inside output file.

Err: wrong line: contig10, 86, C, 1, ^S., C Err: wrong line: contig10, 87, A, 2, G^SG, C@ Err: wrong line: contig_10, 88, C, 2, .., CC

ADD REPLYlink written 7.6 years ago by Gvj440
gravatar for Michael Dondrup
9.4 years ago by
Bergen, Norway
Michael Dondrup47k wrote:

I don't know if that helps, it's more a wild guess, but did you try to lower the -D 10000 parameter. I don't fully understand what it does yet, but the documentation says:

The -D option of varFilter controls the maximum read depth, which should be adjusted to about twice the average read depth. One may consider to add -C50 to mpileup if mapping quality is overestimated for reads containing excessive mismatches. Applying this option usually helps BWA-short but may not other mappers.

This one sentence is the only documentation I found on this. But for your example that would mean you had an average genome base coverage of 5000x. That would really surprise me, so how high is your read depth in relation to the genome? Also, even if you had extremely high coverage locally, you are right with your assumption that the coverage is more variable in RNA-seq than in DNA sequencing. One solution might be to remove the largest coverage outliers (e.g. ribosomal RNA, other stable RNA genes) to calculate the average coverage to adjust the -D parameter. I'd try this.

ADD COMMENTlink written 9.4 years ago by Michael Dondrup47k

it's counter intuitive, but I'm getting less SNP if I decrease -D to 100 or 200 (genome coverage vary between 100 and 200x).

ADD REPLYlink written 9.4 years ago by Leszek4.1k
gravatar for Steve Lianoglou
9.4 years ago by
Steve Lianoglou5.1k
Steve Lianoglou5.1k wrote:

I'm not sure if it helps, as I haven't looked at the tools/pipeline/resources that are available from the methods described in this paper:

Haplotype and isoform specific expression estimation using multi-mapping RNA-seq reads

If you look at Figure 1, you can see that part of their pipeline (if you follow from Start(b)) involves calling genotypes from rna-seq data.

Maybe you can get a hold of their tools(?).

ADD COMMENTlink written 9.4 years ago by Steve Lianoglou5.1k
gravatar for lh3
9.4 years ago by
United States
lh332k wrote:

You may try to add "-A" to mpileup. If fail, you may implement a naive SNP caller based on the pileup output.

It would be more helpful if you have the link to a screenshot or something.

ADD COMMENTlink written 9.4 years ago by lh332k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 729 users visited in the last hour