Question: Finding SNPs in population data: help with pipeline
gravatar for Alex
3.2 years ago by
Alex0 wrote:

I am struggling to identify SNPs in population genomic data that I have. So far my pipeline follows:

  1. Download fastq files from EBI-ENA
  2. Trim reads to match quality threshold of 20
  3. Map reads to reference sequence (BWA aln/sampe)
  4. Convert SAM to pileup (view -q 20, sort, mpileup)

Ultimately, I want to be able to query the pileup file and retrieve SNP information at each base eg:

chr pos        ref  cov A   T   C   G   N
3L  12798928    C   34  0   0   29  5   0

As I'm very new to bioinformatics and processing this kind of data I don't know exactly where I've gone wrong or if there are better/more efficient ways to complete this.

Problems I have struck so far:

  1. All reference bases in my pileup file are N
  2. How can I convert pileup entries to something similar to above? Can this be done with the BAM files as they're much smaller?

Any advice is greatly appreciated, I am reading/referring to the Biostar handbook but this seems like quite a niche problem.

pileup snp population genome • 1.2k views
ADD COMMENTlink modified 2.6 years ago by Pierre Lindenbaum129k • written 3.2 years ago by Alex0
  1. Did you specify the reference fasta genome (option -f in samtools mpileup)?
  2. You can parse the output of samtools mpileup, it shouldn't be too difficult: Alternatively, if you are using python you can use the pysam module, it is very nice and easy to use, and does all the parsing for you. More information here:

Regarding the huge size of pileup files, it would be better if you don't save all that to a file bu just connect the output to your script with a pipe (or use pysam).

ADD REPLYlink written 3.2 years ago by abascalfederico1.1k
gravatar for Pierre Lindenbaum
2.6 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum129k wrote:

Ultimately, I want to be able to query the pileup file and retrieve SNP information at each base

I've writtern FindAllCoverageAtPosition (see also How to generate alignment report from bam files )

it will output the bases at each position:

$ find ./testdata/ -type f -name "*.bam" | \
 java -jar dist/findallcoverageatposition.jar -p rotavirus:100
#File              CHROM      POS  SAMPLE  DEPTH  M    I  D  N  S   H  P  EQ  X  Base(A)  Base(C)  Base(G)  Base(T)  Base(N)  Base(^)  Base(-)
./testdata/S4.bam  rotavirus  100  S4      126    126  0  0  0  29  0  0  0   0  5        0        0        121      0        0        0
./testdata/S1.bam  rotavirus  100  S1      317    317  1  0  0  50  0  0  0   0  27       0        1        289      0        1        0
./testdata/S2.bam  rotavirus  100  S2      311    311  0  1  0  60  0  0  0   0  29       1        0        281      0        0        1
./testdata/S3.bam  rotavirus  100  S3      446    446  1  0  0  86  0  0  0   0  39       0        1        406      0        1        0
ADD COMMENTlink written 2.6 years ago by Pierre Lindenbaum129k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1017 users visited in the last hour