Question

sam2bed processing paired-end data and producing unexpectedly short intervals in bed file....

0

Entering edit mode

6.7 years ago

chrisclarkson100 ▴ 150

I have downloaded the 2 paired end fastq files from ENA- documenting MNase Seq data for nucleosome occupancy across the mouse genome: http://www.ebi.ac.uk/ena/data/view/SRR2034494 I mapped the files with bowtie2:

bowtie2 -x mm9 --very-sensitive -p 8 -1 SRR1.fastq -2 SRR2.fastq -S out.sam

94.83% overall alignment rate

everything seems normal here but the problem is that the bed file that results from sam2bed

sam2bed < out.sam > out.bed

has resulted in a bed file with interval sizes that are typically in the 50-70 bp range... This is impossible as nucleosomes are 147 bp long and hence reads of <147 are not possible.

After receiving some advice I have found that the intervals may be too short due to 'sam2bed'. Does anyone know of a way to account for paired ends in sam files when converting to a bed file?

NOTE: I have also tried this with other independent nucleosome datasets and the same thing happens so it is not a question of the data

alignment sam2bed • 2.5k views

ADD COMMENT • link updated 6.6 years ago by ATpoint 82k • written 6.7 years ago by chrisclarkson100 ▴ 150

0

Entering edit mode

--solexa-quals

Are you sure about this? Unless data is old (5+ years) it is unlikely to be in solexa quality format.

ADD REPLY • link 6.7 years ago by GenoMax 141k

0

Entering edit mode

Wait sorry I should have specified: it is the bowtie2 (which doesn't have any solexa quals component) command that I am having problems with rather than bowtie 1

ADD REPLY • link 6.7 years ago by chrisclarkson100 ▴ 150

0

Entering edit mode

The intervals represent individual reads, not fragments, so if the input was 50-70 base long reads then what you're seeing is entirely expected.

ADD REPLY • link 6.7 years ago by Devon Ryan 104k

0

Entering edit mode

Thanks for the reply, please note that I have updated the question

ADD REPLY • link 6.6 years ago by chrisclarkson100 ▴ 150

0

Entering edit mode

What is your end goal with all of this? I suspect there's a better way to go about things.

ADD REPLY • link 6.6 years ago by Devon Ryan 104k

0

Entering edit mode

Hi, I contacted the address given on the pubmed page. And I have been assured that this is paired-end data documenting nucleosome occupancy genome wide. They confirmed that the intervals in my bed file are too small to be nucleosomes. Instead one should expect 100-200 bp ranges rather than 50-70 bp. They told me that they originally mapped the data with bowtie 1 rather than 2 and this produced much larger, expected intervals.... I am considering just moving back to bowtie1 but I am still attracted to the superior alignment rates given by bowtie2 and would like to pursue it until I am sure that it can not produce same output as bowtie 1 with greater alignment. My downstream analysis requires for the data to be representative of the nucleosome sizes.

ADD REPLY • link 6.6 years ago by chrisclarkson100 ▴ 150

0

Entering edit mode

Did you even read my answer, explaining why you get these short intervals? You are converting single end reads instead of paired-end fragments.

I have also tried this with other independent nucleosome datasets and the same thing happens so it is not a question of the data.

This should have been the alarm for you that your pipeline is wrong and not the data.

ADD REPLY • link 6.6 years ago by ATpoint 82k

0

Entering edit mode

Hi, I should have replied to your answer I will do so now in a comment on your answer below

ADD REPLY • link 6.6 years ago by chrisclarkson100 ▴ 150

0

Entering edit mode

chrisclarkson100 : bowtie v.1 does ungapped alignments so keep that in mind when thinking about "superior" bowtie2 alignments which may be gapped.

ADD REPLY • link 6.6 years ago by GenoMax 141k

0

Entering edit mode

Right, so you haven't answered my question. What exactly are you trying to do with the data? If you run bamPEFragmentSize on the BAM file you'll probably see a more meaningful distribution. But even if you can make a BED file that matches that it doesn't mean that that's a reasonable way or proceeding.

ADD REPLY • link 6.6 years ago by Devon Ryan 104k

0

Entering edit mode

My job is to document the position of each nucleosome in the genome- using various datasets from MNAse experiments. My job will then be to calculate the nucleosome repeat length (NRL) in different regions. I have a script that does this by analysing the consecutive distances between the starts of each positioned nucleosome. The script requires the interval distances documenting the start and end of each nucleosome to be of a realistic size i.e. not < 147 in order to give viable statistics.... I have previously used a script to account for the paired end read data: https://github.com/homeveg/nuctools/extend_PE_reads.pl but this did not seem to make a difference to the intervals, the vast majority of which are < 147. Hence I was worried that it might be a step prior to taking PEs into account.

ADD REPLY • link 6.6 years ago by chrisclarkson100 ▴ 150

0

Entering edit mode

I imagine your job would be much easier if you skipped BED files and instead did bamCoverage --MNase -bs 1, which will give you a bigWig file of nucleosome density. I'd just Fourier transform that and get the repeat length that way.

ADD REPLY • link 6.6 years ago by Devon Ryan 104k

score 0 · Answer 1 · 2017-09-04

0

Entering edit mode

6.6 years ago

ATpoint 82k

There are some things you have to address:

Any bamtobed application, be it bedops or bedtools expects name-sorted input for paired-end data. Also you have to specify in the command that your file is PE. As Devon said, your output contains individual reads, but not pairs/fragments. Sort your file by name, then pipe this into bedtools with the -bedpe option. Extracting then the start of the forward read and the end of the reverse read together with chr and strand information will give you a "normal" BED file representing your paired-end data You can get this by:

samtools sort -n -l 0 -O bam in.sam | bedtools bamtobed -i - -bedpe | cut -f1,2,6,7,8,9 | sort -k1,1 -k2,2n > output.bed

I am not too familiar with MNase-seq but aren't these files typically pretty big, so conversion will take time and space? What are you planning to do with the BEDs? Probably there is a way to do your downstream without leaving the binary format.

ADD COMMENT • link 6.6 years ago by ATpoint 82k

0

Entering edit mode

Hi, I tried this on my own data and it gave a very strange output that was not in bed format:

samtools sort -n -l 0 -O bam Ishii.sam | bedtools bamtobed -i - -bedpe | cut -f1,2,6,7,8,9 | sort -k1,1 -k2,2n > sorted_new.bed

.       -1      100000009       SRR1781827.163167397    0       .
.       -1      100000009       SRR1781827.37847207     0       .
.       -1      10000000        SRR1781827.87602083     0       .
.       -1      100000011       SRR1781827.118245882    0       .
.       -1      100000011       SRR1781827.143216953    0       .
.       -1      100000011       SRR1781827.56370058     0       .
.       -1      10000001        SRR1781827.39763233     0       .

ADD REPLY • link 6.6 years ago by chrisclarkson100 ▴ 150

0

Entering edit mode

The command has been working perfectly fine for me. Maybe an issue with your file. Please post a small subset of your bam.

samtools view in.sam | head -n 10

Anyway, this is what you should get as a result from the bedpe command:

chr1 629909 630006 M03229:13:000000000-B889M:1:1119:18672:4690 51 +

ADD REPLY • link 6.6 years ago by ATpoint 82k

0

Entering edit mode

for me samtools view Ishii.sam | head -n 10 returns

SRR1781827.7    99  chr13   88806847    1   50M =   88806979    182 AGTTCCTTTGTGAATGTGTTACCTCACTCAGGATGATGCCCTCCAGGTCC  BCCFDFFFHGHHFHIJIJJJIJJIJGJJJIIIHIJHIJJJJIJJIJIJJJ  AS:i:0  XS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:50 YS:i:0  YT:Z:CP
SRR1781827.7    147 chr13   88806979    1   50M =   88806847    -182    GTATCCATTCCTCTGTTGAGGGGCATCTGGGTTCTTTCCAGCTTCTGGCT  JIIEJIIHGJIGIJIJIHJJJJJJIIIGJJJJIJJJJHHHHHFFFFFCCC  AS:i:0  XS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:50 YS:i:0  YT:Z:CP
SRR1781827.3    83  chr16   11175374    42  50M =   11175334    -90 GGTGGTGGCACGGCTTTTAAATCACAGCATTCGGGAGGTAGGGGAAAGCA  JGIGJJIJIJJJIJIHFDIIIHIIJGIIGGFJJJJIJHHHHHDFFFFCCB  AS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:50 YS:i:0  YT:Z:CP
SRR1781827.3    163 chr16   11175334    42  50M =   11175374    90  TCTTTCAAATCAGACTTTAAAACTGAAACCGGAGTGGGGCGGTGGTGGCA  BBBFFFFFGHHHHIJIIJHHHJJJIJJHJJJJ?HCHIIJEHF>BD>BDBA  AS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:50 YS:i:0  YT:Z:CP

ADD REPLY • link 6.6 years ago by chrisclarkson100 ▴ 150

0

Entering edit mode

I should also mention that I have previously used a script to account for the paired end read data: https://github.com/homeveg/nuctools/extend_PE_reads.pl but this did not seem to make a difference to the intervals, the vast majority of which are < 147. Hence I was worried that it might be a step prior to taking PEs into account.

ADD REPLY • link 6.6 years ago by chrisclarkson100 ▴ 150

0

Entering edit mode

I don't actually have this file in BAM format... I mapped it directly to sam format with:

bowtie2 -x mm9 --very-sensitive -p 8 -1 SRR1.fastq -2 SRR2.fastq -S out.sam

so the only file here is in sam format

ADD REPLY • link 6.6 years ago by chrisclarkson100 ▴ 150

0

Entering edit mode

Look, it is pretty simple. Take the sam directly from bowtie without further manipulation and then use the bedpe script I posted. It will get the job done.

ADD REPLY • link 6.6 years ago by ATpoint 82k