Question

Finding SNPs in population data: help with pipeline

1

Entering edit mode

8.5 years ago

Alex ▴ 10

I am struggling to identify SNPs in population genomic data that I have. So far my pipeline follows:

Download fastq files from EBI-ENA
Trim reads to match quality threshold of 20
Map reads to reference sequence (BWA aln/sampe)
Convert SAM to pileup (view -q 20, sort, mpileup)

Ultimately, I want to be able to query the pileup file and retrieve SNP information at each base eg:

chr pos        ref  cov A   T   C   G   N
3L  12798928    C   34  0   0   29  5   0

As I'm very new to bioinformatics and processing this kind of data I don't know exactly where I've gone wrong or if there are better/more efficient ways to complete this.

Problems I have struck so far:

All reference bases in my pileup file are N
How can I convert pileup entries to something similar to above? Can this be done with the BAM files as they're much smaller?

Any advice is greatly appreciated, I am reading/referring to the Biostar handbook but this seems like quite a niche problem.

SNP population pileup genome • 2.7k views

ADD COMMENT • link updated 7.9 years ago by Pierre Lindenbaum 166k • written 8.5 years ago by Alex ▴ 10

0

Entering edit mode

Did you specify the reference fasta genome (option -f in samtools mpileup)?
You can parse the output of samtools mpileup, it shouldn't be too difficult: https://en.wikipedia.org/wiki/Pileup_format Alternatively, if you are using python you can use the pysam module, it is very nice and easy to use, and does all the parsing for you. More information here: http://pysam.readthedocs.io/en/latest/

Regarding the huge size of pileup files, it would be better if you don't save all that to a file bu just connect the output to your script with a pipe (or use pysam).

ADD REPLY • link 8.5 years ago by abascalfederico ★ 1.2k

score 0 · Answer 1 · 2017-12-01

Ultimately, I want to be able to query the pileup file and retrieve SNP information at each base

I've writtern FindAllCoverageAtPosition http://lindenb.github.io/jvarkit/FindAllCoverageAtPosition.html (see also How to generate alignment report from bam files )

it will output the bases at each position:

$ find ./testdata/ -type f -name "*.bam" | \
 java -jar dist/findallcoverageatposition.jar -p rotavirus:100
#File              CHROM      POS  SAMPLE  DEPTH  M    I  D  N  S   H  P  EQ  X  Base(A)  Base(C)  Base(G)  Base(T)  Base(N)  Base(^)  Base(-)
./testdata/S4.bam  rotavirus  100  S4      126    126  0  0  0  29  0  0  0   0  5        0        0        121      0        0        0
./testdata/S1.bam  rotavirus  100  S1      317    317  1  0  0  50  0  0  0   0  27       0        1        289      0        1        0
./testdata/S2.bam  rotavirus  100  S2      311    311  0  1  0  60  0  0  0   0  29       1        0        281      0        0        1
./testdata/S3.bam  rotavirus  100  S3      446    446  1  0  0  86  0  0  0   0  39       0        1        406      0        1        0