I want to reproduce the results that people achieved in the following Nature paper: Transcriptome genetics using second generation sequencing in a Caucasian population http://www.nature.com/nature/journal/vaop/ncurrent/full/nature08903.html
I downloaded a reference fasta and fai file from: http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/pilot_data/technical/reference/
The main problems seem to exist that I'm not able to convert these SAM files into proper "working" BAM files so that I can get BED files that is the input format for FluxCapacitor (http://flux.sammeth.net/). I tried using the following steps (as there is no "proper" header in the SAM files I've to do some additional steps):
- samtools view -bt human_b36_male.fa.gz.fai first.sam> first.bam
- samtools sort first.bam first.bam.sorted
- samtools index first.bam.sorted
- samtools index aln-sorted.bam
When I then run the following command to get BED files I end up with an empty bed file and errors:
bamToBed -i reads.bam > reads.bed *this is using BEDtools Gives the following error:
terminate called after throwing an instance of 'std::out_of_range' what(): vector::_M_range_check
samtools pileup -cf human_b36_male.fa reads.bam
To make sure the downloaded SAM files are correctly I tried to load the original SAM files into IGV I don't see any reads aligning anywhere to the genome. I already reconstructed another genome set using the fasta files that are closest to the mentioned Ensembl release (54) they mention in the paper (fasta file from the 1000 Genomes FTP again - but maybe this is not correct, though cannot find any better suiting Fasta files that might represent the correct Ensembl release/genome build) but this doesn't help as still none of the reads seem to align to the reference genome (or at least no reads show up in the viewer).
Anyone with some tips about the issues I've generating BED files from the published SAM files. Also any comments about why the IGV isn't showing any reads might be helpful for me understanding what's going wrong.
So now I can generate directly BED files from the SAM files but I'm still wondering which exact reference genome they used (fasta file) so that I can generate BAM files and just because I'm curious now.