Question: Converting Sam Files To Bam Files - Reproduce Results Nature Paper: Transcriptome Genetics Using Second Generation Sequencing In A Caucasian Population
7
gravatar for Sander Timmer
6.8 years ago by
Sander Timmer700
United Kingdom
Sander Timmer700 wrote:

I want to reproduce the results that people achieved in the following Nature paper: Transcriptome genetics using second generation sequencing in a Caucasian population http://www.nature.com/nature/journal/vaop/ncurrent/full/nature08903.html

I downloaded their SAM files from the groups website: http://funpopgen.unige.ch/data/ceu60

I downloaded a reference fasta and fai file from: http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/pilot_data/technical/reference/

The main problems seem to exist that I'm not able to convert these SAM files into proper "working" BAM files so that I can get BED files that is the input format for FluxCapacitor (http://flux.sammeth.net/). I tried using the following steps (as there is no "proper" header in the SAM files I've to do some additional steps):

  1. samtools view -bt human_b36_male.fa.gz.fai first.sam> first.bam
  2. samtools sort first.bam first.bam.sorted
  3. samtools index first.bam.sorted
  4. samtools index aln-sorted.bam

When I then run the following command to get BED files I end up with an empty bed file and errors:

  1. bamToBed -i reads.bam > reads.bed *this is using BEDtools Gives the following error:

    terminate called after throwing an instance of 'std::out_of_range' what(): vector::_M_range_check

As the BAM to BED file conversation seems not to be working properly I decided to check if I can use samtools to get basepair abundances using samtools pileup. I run the following but get no output:

samtools pileup -cf human_b36_male.fa reads.bam

To make sure the downloaded SAM files are correctly I tried to load the original SAM files into IGV I don't see any reads aligning anywhere to the genome. I already reconstructed another genome set using the fasta files that are closest to the mentioned Ensembl release (54) they mention in the paper (fasta file from the 1000 Genomes FTP again - but maybe this is not correct, though cannot find any better suiting Fasta files that might represent the correct Ensembl release/genome build) but this doesn't help as still none of the reads seem to align to the reference genome (or at least no reads show up in the viewer).

Anyone with some tips about the issues I've generating BED files from the published SAM files. Also any comments about why the IGV isn't showing any reads might be helpful for me understanding what's going wrong.

UPDATE Put 2 SAM files into my public directory so that people don't have to download the full 50GB: http://www.ebi.ac.uk/~swtimmer/RNAseq/

UPDATE2

So now I can generate directly BED files from the SAM files but I'm still wondering which exact reference genome they used (fasta file) so that I can generate BAM files and just because I'm curious now.

bedtools bam samtools sam • 5.5k views
ADD COMMENTlink modified 6.8 years ago by David Langenberger8.4k • written 6.8 years ago by Sander Timmer700
2

Is it possible they aligned to contigs or something? If I google for AL662826.11, it appears to be something like this: http://www.ncbi.nlm.nih.gov/nuccore/20870241

Maybe you could post a single bam file somewhere for people to look at, because nobody wants to download 50 gigs.

ADD REPLYlink written 6.8 years ago by Madelaine Gogol5.0k
1

@brentp, sorry should have mentioned that this is how I started: [swtimmer@ebi-001 RNAseq]$ samtools view -bS 22087.sam.gz | bedtools bamtobed > reads.bed -bash: bedtools: command not found [samopen] no @SQ lines in the header. [samread1] missing header? Abort!

But have to use "a" fasta reference file to get this working. Paper states they used NCBI36 but I think this is where things go wrong as I think I use somehow a different reference genome (the BAM files tells me that none of the reads are aligned to any of the annotations.

ADD REPLYlink written 6.8 years ago by Sander Timmer700

How about posting the SAM header (or at least a few lines)? Do the sequence names in the fasta file match those used in the SAM file?

ADD REPLYlink written 6.8 years ago by Sean Davis25k

you can shorten your commands to: samtools view -bS first.sam | bedtools bamtobed > reads.bed does that work for you?

ADD REPLYlink written 6.8 years ago by brentp22k

Did you try 'samtools view aln.bam | less'? Check that everything looks okay, that you get the correct chromosome names (matching the fasta file). Also, check that the fasta file doesn't use too long lines, many tools use a fixed buffer for line length.

ADD REPLYlink written 6.8 years ago by Ketil3.9k

@Sean Davis I'm looking into this, I have the feeling that this is the source of my problem. When looking at my BAM files: samtools view 3125_8.sam.gz.bam | awk '{print $3}' | sort | uniq -c 27653784 *

So that is not what we expected to see....

First 2 lines of the SAM file: [swtimmer@ebi-001 RNAseq]$ more 31258.sam IL63125:8:58:1625:1479 67 clone::AL662826.11:1:145431:1 1261 0 37M * 0 247 AAAAGGAGTAGGCAGGAAAACAGTC AATTATGGATTC ?BBCBBBB@<BBBBCB@A@?B?B>@A@B@BABB?B@? MF:i:18 Aq:i:0 NM:i:0 UQ:i:0 H0:i:4 H1:i:0 IL6_3125:8:37:57:1851 131 clone::AL662826.11:1:145431:1 1458 0 37M * 0 262 GTGAA

ADD REPLYlink written 6.8 years ago by Sander Timmer700

Is that fai file gzipped? Do you need to unzip that first, or did you already?

ADD REPLYlink written 6.8 years ago by Madelaine Gogol5.0k

@Madelaine, updated the question with a link to 2 SAM files: http://www.ebi.ac.uk/~swtimmer/RNAseq/

ADD REPLYlink written 6.8 years ago by Sander Timmer700
5
gravatar for brentp
6.8 years ago by
brentp22k
Salt Lake City, UT
brentp22k wrote:

I checked one of the files from their site. The problem is they have no header. Bamtools, used by bedtools expects a header.

You can just make a BED file your self via the following:

zless 1184_1.sam.gz |  awk 'BEGIN{OFS=FS="\t"}(! and($2, 4)){ print $3,$4 - 1, $4 + length($10), $0 }' > out.bed

adding columns as you need them. Currently it prints chrom, start, end for aligned reads.

They did align to contigs because the chromosome names look like clone::AL662826.11:1:145431:1

ADD COMMENTlink written 6.8 years ago by brentp22k
1

If you can get a hold of the genome they used, you can use picard CreateSequenceDictionary to use as a header.

ADD REPLYlink written 6.8 years ago by Madelaine Gogol5.0k
1

When using the awk script above, assure that you don't have any indels, otherwise these loci will be wrong. $10 is the read sequence and not the reference sequence, it maps to.

ADD REPLYlink written 6.8 years ago by David Langenberger8.4k
1

@brentp and @madelaine I just cannot really find out what specific genome build/contigs they used. I tried now several ncbi36 versions but they all seem to fail. Maybe I should just ask the author for a more specific description of the reference they used

ADD REPLYlink written 6.8 years ago by Sander Timmer700

Will try to run this to get BED files. Thanks, looks like a solution. But still I'd be interested to know if its possible to get proper working BAM files out of these SAM files.

ADD REPLYlink written 6.8 years ago by Sander Timmer700

right, or samtools -t genome.fasta

ADD REPLYlink written 6.8 years ago by brentp22k

When using the awk script above, assure that you don't have any indels, otherwise these loci will be wrong. $10 is the read sequence and the reference sequence, it maps to.

ADD REPLYlink written 6.8 years ago by David Langenberger8.4k

@brentp -- samtools -t genome.fasta? I thought there might be a way to do this in samtools, but I couldn't remember. Are you missing a word in there?

ADD REPLYlink written 6.8 years ago by Madelaine Gogol5.0k

Oh, hey, I just found it: samtools view -bt genome.fa.fai sample.sam

Much easier.

ADD REPLYlink written 6.8 years ago by Madelaine Gogol5.0k
2
gravatar for David Langenberger
6.8 years ago by
Deutschland
David Langenberger8.4k wrote:
 zcat 1184_8.sam.gz | \
 perl -F'\t' -ane \
    '$strand=($F[1]&16)?"-":"+";\
     $length=1;\
     $tmp=$F[5];\
     $tmp =~ s/(\d+)[MD]/$length+=$1/eg;\
     print "$F[2]\t$F[3]\t".($F[3]+$length)."\t$F[0]\t0\t$strand\n";'\
 > 1184_8.bed
ADD COMMENTlink modified 5.7 years ago by Sukhdeep Singh9.5k • written 6.8 years ago by David Langenberger8.4k
3

But nevertheless.... to get a bam file, you'll need the header.

ADD REPLYlink written 6.8 years ago by David Langenberger8.4k
2

Can you make this a little uglier? Heh.

ADD REPLYlink written 6.8 years ago by Madelaine Gogol5.0k
1

Sorry... first version didn't check the strand correctly. Now it should work correctly. :)

ADD REPLYlink written 6.8 years ago by David Langenberger8.4k
1

Well, it works, right? Of course you can map the reads again to get the header, or spend hours in creating the header. I just prefer the most painless way of creating a bed file out of the sam file.

Even, if it is ugly. ;)

I was just writing it down, without spending a lot of time with writing it nicely. BUT... I named the variables 'length' and 'strand', so it should be possible to understand it.... perhaps.

ADD REPLYlink written 6.8 years ago by David Langenberger8.4k
1
gravatar for Line Skotte
6.7 years ago by
Line Skotte10
Line Skotte10 wrote:

I generated a header the following way:

cut -f3 1382_1.sam | uniq > tmp_refnames cut -d':' -f5 tmp_refnames > tmp_reflengths paste -d 'n' $(yes headerstart|head -201)>tmp_headerstart2 paste -d ':' tmp_headerstart2 tmp_refnames > tmp_refnames2 paste -d 'n' $(yes headerinsert|head -201)>tmp_headerinsert2 paste tmp_refnames2 tmp_headerinsert2 > tmp_refnames3 paste -d: tmp_refnames3 tmp_reflengths > headerfile

where headerstart is a textfile containing '@SQ SN' and headerinsert is a textfile containing 'LN'. Not very elegant, but gives you a header list with the refnames and reflengths. It turns out that they mapped to 201 different ref sequences.

Among them is e.g.

@SQ SN:chromosome:NCBI36:15:1:100338915:1 LN:100338915 @SQ SN:chromosome:NCBI36:Y:2709521:57443437:1 LN:57443437 @SQ SN:chromosome:NCBI36:6:1:170896992:1 LN:170896992 @SQ SN:chromosome:NCBI36:4:1:191263063:1 LN:191263063 @SQ SN:chromosome:NCBI36:5:1:180837866:1 LN:180837866

etc.

which i guess is the build36 but chopped up into chrs.

to add the created header to the sam files: cat headerfile 1382_1.sam > 1382_1h.sam

looking at the data: samtools view 1382_1h.sam -S -b >1382_1.bam samtools sort 1382_1.bam 1382_1sort samtools index 1382_1sort.bam

samtools pileup 1382_1sort.bam | less

looking with tview gives some problems if you wish to use the goto function due to the long refnames!!!

Hope this helps you with your problem.

Best, Line Skotte

ADD COMMENTlink written 6.7 years ago by Line Skotte10

ups i forgot some newlines above:

cut -f3 13821.sam | uniq > tmprefnames cut -d':' -f5 tmprefnames > tmpreflengths paste -d '\n' $(yes headerstart|head -201)>tmpheaderstart2 paste -d ':' tmpheaderstart2 tmprefnames > tmprefnames2 paste -d '\n' $(yes headerinsert|head -201)>tmpheaderinsert2 paste tmprefnames2 tmpheaderinsert2 > tmprefnames3 paste -d: tmprefnames3 tmpreflengths > headerfile

ADD REPLYlink written 6.7 years ago by Line Skotte10

cut -f3 13821.sam | uniq > tmprefnames

cut -d':' -f5 tmprefnames > tmpreflengths

paste -d '\n' $(yes headerstart|head -201)>tmp_headerstart2

paste -d ':' tmpheaderstart2 tmprefnames > tmp_refnames2

paste -d '\n' $(yes headerinsert|head -201)>tmp_headerinsert2

paste tmprefnames2 tmpheaderinsert2 > tmp_refnames3

paste -d: tmprefnames3 tmpreflengths > headerfile

ADD REPLYlink written 6.7 years ago by Line Skotte10

Sorry about the missing newlines above, hope that you can put them in youself. Before every 'cut' and 'paste'.

ADD REPLYlink written 6.7 years ago by Line Skotte10
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1852 users visited in the last hour