Question: Finding SNPs in population data: help with pipeline
0
gravatar for Alex
12 months ago by
Alex0
Alex0 wrote:

I am struggling to identify SNPs in population genomic data that I have. So far my pipeline follows:

  1. Download fastq files from EBI-ENA
  2. Trim reads to match quality threshold of 20
  3. Map reads to reference sequence (BWA aln/sampe)
  4. Convert SAM to pileup (view -q 20, sort, mpileup)

Ultimately, I want to be able to query the pileup file and retrieve SNP information at each base eg:

chr pos        ref  cov A   T   C   G   N
3L  12798928    C   34  0   0   29  5   0

As I'm very new to bioinformatics and processing this kind of data I don't know exactly where I've gone wrong or if there are better/more efficient ways to complete this.

Problems I have struck so far:

  1. All reference bases in my pileup file are N
  2. How can I convert pileup entries to something similar to above? Can this be done with the BAM files as they're much smaller?

Any advice is greatly appreciated, I am reading/referring to the Biostar handbook but this seems like quite a niche problem.

pileup snp population genome • 591 views
ADD COMMENTlink modified 5 months ago by Pierre Lindenbaum107k • written 12 months ago by Alex0
  1. Did you specify the reference fasta genome (option -f in samtools mpileup)?
  2. You can parse the output of samtools mpileup, it shouldn't be too difficult: https://en.wikipedia.org/wiki/Pileup_format Alternatively, if you are using python you can use the pysam module, it is very nice and easy to use, and does all the parsing for you. More information here: http://pysam.readthedocs.io/en/latest/

Regarding the huge size of pileup files, it would be better if you don't save all that to a file bu just connect the output to your script with a pipe (or use pysam).

ADD REPLYlink written 12 months ago by abascalfederico1.0k
0
gravatar for Pierre Lindenbaum
5 months ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum107k wrote:

Ultimately, I want to be able to query the pileup file and retrieve SNP information at each base

I've writtern FindAllCoverageAtPosition http://lindenb.github.io/jvarkit/FindAllCoverageAtPosition.html (see also How to generate alignment report from bam files )

it will output the bases at each position:

$ find ./testdata/ -type f -name "*.bam" | \
 java -jar dist/findallcoverageatposition.jar -p rotavirus:100
#File              CHROM      POS  SAMPLE  DEPTH  M    I  D  N  S   H  P  EQ  X  Base(A)  Base(C)  Base(G)  Base(T)  Base(N)  Base(^)  Base(-)
./testdata/S4.bam  rotavirus  100  S4      126    126  0  0  0  29  0  0  0   0  5        0        0        121      0        0        0
./testdata/S1.bam  rotavirus  100  S1      317    317  1  0  0  50  0  0  0   0  27       0        1        289      0        1        0
./testdata/S2.bam  rotavirus  100  S2      311    311  0  1  0  60  0  0  0   0  29       1        0        281      0        0        1
./testdata/S3.bam  rotavirus  100  S3      446    446  1  0  0  86  0  0  0   0  39       0        1        406      0        1        0
ADD COMMENTlink written 5 months ago by Pierre Lindenbaum107k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 839 users visited in the last hour