Question: sam2bed processing paired-end data and producing unexpectedly short intervals in bed file....
0
gravatar for chrisclarkson100
17 months ago by
European Union
chrisclarkson10050 wrote:

I have downloaded the 2 paired end fastq files from ENA- documenting MNase Seq data for nucleosome occupancy across the mouse genome: http://www.ebi.ac.uk/ena/data/view/SRR2034494 I mapped the files with bowtie2:

bowtie2 -x mm9 --very-sensitive -p 8 -1 SRR1.fastq -2 SRR2.fastq -S out.sam

94.83% overall alignment rate

everything seems normal here but the problem is that the bed file that results from sam2bed

sam2bed < out.sam > out.bed

has resulted in a bed file with interval sizes that are typically in the 50-70 bp range... This is impossible as nucleosomes are 147 bp long and hence reads of <147 are not possible.

After receiving some advice I have found that the intervals may be too short due to 'sam2bed'. Does anyone know of a way to account for paired ends in sam files when converting to a bed file?

NOTE: I have also tried this with other independent nucleosome datasets and the same thing happens so it is not a question of the data

sam2bed alignment • 723 views
ADD COMMENTlink modified 17 months ago by ATpoint13k • written 17 months ago by chrisclarkson10050

--solexa-quals

Are you sure about this? Unless data is old (5+ years) it is unlikely to be in solexa quality format.

ADD REPLYlink written 17 months ago by genomax62k

Wait sorry I should have specified: it is the bowtie2 (which doesn't have any solexa quals component) command that I am having problems with rather than bowtie 1

ADD REPLYlink written 17 months ago by chrisclarkson10050

The intervals represent individual reads, not fragments, so if the input was 50-70 base long reads then what you're seeing is entirely expected.

ADD REPLYlink written 17 months ago by Devon Ryan88k

Thanks for the reply, please note that I have updated the question

ADD REPLYlink written 17 months ago by chrisclarkson10050

What is your end goal with all of this? I suspect there's a better way to go about things.

ADD REPLYlink written 17 months ago by Devon Ryan88k

Hi, I contacted the address given on the pubmed page. And I have been assured that this is paired-end data documenting nucleosome occupancy genome wide. They confirmed that the intervals in my bed file are too small to be nucleosomes. Instead one should expect 100-200 bp ranges rather than 50-70 bp. They told me that they originally mapped the data with bowtie 1 rather than 2 and this produced much larger, expected intervals.... I am considering just moving back to bowtie1 but I am still attracted to the superior alignment rates given by bowtie2 and would like to pursue it until I am sure that it can not produce same output as bowtie 1 with greater alignment. My downstream analysis requires for the data to be representative of the nucleosome sizes.

ADD REPLYlink modified 17 months ago • written 17 months ago by chrisclarkson10050

Did you even read my answer, explaining why you get these short intervals? You are converting single end reads instead of paired-end fragments.

I have also tried this with other independent nucleosome datasets and the same thing happens so it is not a question of the data.

This should have been the alarm for you that your pipeline is wrong and not the data.

ADD REPLYlink modified 17 months ago • written 17 months ago by ATpoint13k

Hi, I should have replied to your answer I will do so now in a comment on your answer below

ADD REPLYlink written 17 months ago by chrisclarkson10050

chrisclarkson100 : bowtie v.1 does ungapped alignments so keep that in mind when thinking about "superior" bowtie2 alignments which may be gapped.

ADD REPLYlink modified 17 months ago • written 17 months ago by genomax62k

Right, so you haven't answered my question. What exactly are you trying to do with the data? If you run bamPEFragmentSize on the BAM file you'll probably see a more meaningful distribution. But even if you can make a BED file that matches that it doesn't mean that that's a reasonable way or proceeding.

ADD REPLYlink written 17 months ago by Devon Ryan88k

My job is to document the position of each nucleosome in the genome- using various datasets from MNAse experiments. My job will then be to calculate the nucleosome repeat length (NRL) in different regions. I have a script that does this by analysing the consecutive distances between the starts of each positioned nucleosome. The script requires the interval distances documenting the start and end of each nucleosome to be of a realistic size i.e. not < 147 in order to give viable statistics.... I have previously used a script to account for the paired end read data: https://github.com/homeveg/nuctools/extend_PE_reads.pl but this did not seem to make a difference to the intervals, the vast majority of which are < 147. Hence I was worried that it might be a step prior to taking PEs into account.

ADD REPLYlink written 17 months ago by chrisclarkson10050

I imagine your job would be much easier if you skipped BED files and instead did bamCoverage --MNase -bs 1, which will give you a bigWig file of nucleosome density. I'd just Fourier transform that and get the repeat length that way.

ADD REPLYlink written 17 months ago by Devon Ryan88k
0
gravatar for ATpoint
17 months ago by
ATpoint13k
Germany
ATpoint13k wrote:

There are some things you have to address:

Any bamtobed application, be it bedops or bedtools expects name-sorted input for paired-end data. Also you have to specify in the command that your file is PE. As Devon said, your output contains individual reads, but not pairs/fragments. Sort your file by name, then pipe this into bedtools with the -bedpe option. Extracting then the start of the forward read and the end of the reverse read together with chr and strand information will give you a "normal" BED file representing your paired-end data You can get this by:

samtools sort -n -l 0 -O bam in.sam | bedtools bamtobed -i - -bedpe | cut -f1,2,6,7,8,9 | sort -k1,1 -k2,2n > output.bed

I am not too familiar with MNase-seq but aren't these files typically pretty big, so conversion will take time and space? What are you planning to do with the BEDs? Probably there is a way to do your downstream without leaving the binary format.

ADD COMMENTlink modified 17 months ago • written 17 months ago by ATpoint13k

Hi, I tried this on my own data and it gave a very strange output that was not in bed format:

samtools sort -n -l 0 -O bam Ishii.sam | bedtools bamtobed -i - -bedpe | cut -f1,2,6,7,8,9 | sort -k1,1 -k2,2n > sorted_new.bed

.       -1      100000009       SRR1781827.163167397    0       .
.       -1      100000009       SRR1781827.37847207     0       .
.       -1      10000000        SRR1781827.87602083     0       .
.       -1      100000011       SRR1781827.118245882    0       .
.       -1      100000011       SRR1781827.143216953    0       .
.       -1      100000011       SRR1781827.56370058     0       .
.       -1      10000001        SRR1781827.39763233     0       .
ADD REPLYlink written 17 months ago by chrisclarkson10050

The command has been working perfectly fine for me. Maybe an issue with your file. Please post a small subset of your bam.

samtools view in.sam | head -n 10

Anyway, this is what you should get as a result from the bedpe command:

chr1 629909 630006 M03229:13:000000000-B889M:1:1119:18672:4690 51 +

ADD REPLYlink modified 17 months ago • written 17 months ago by ATpoint13k

for me samtools view Ishii.sam | head -n 10 returns

SRR1781827.7    99  chr13   88806847    1   50M =   88806979    182 AGTTCCTTTGTGAATGTGTTACCTCACTCAGGATGATGCCCTCCAGGTCC  BCCFDFFFHGHHFHIJIJJJIJJIJGJJJIIIHIJHIJJJJIJJIJIJJJ  AS:i:0  XS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:50 YS:i:0  YT:Z:CP
SRR1781827.7    147 chr13   88806979    1   50M =   88806847    -182    GTATCCATTCCTCTGTTGAGGGGCATCTGGGTTCTTTCCAGCTTCTGGCT  JIIEJIIHGJIGIJIJIHJJJJJJIIIGJJJJIJJJJHHHHHFFFFFCCC  AS:i:0  XS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:50 YS:i:0  YT:Z:CP
SRR1781827.3    83  chr16   11175374    42  50M =   11175334    -90 GGTGGTGGCACGGCTTTTAAATCACAGCATTCGGGAGGTAGGGGAAAGCA  JGIGJJIJIJJJIJIHFDIIIHIIJGIIGGFJJJJIJHHHHHDFFFFCCB  AS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:50 YS:i:0  YT:Z:CP
SRR1781827.3    163 chr16   11175334    42  50M =   11175374    90  TCTTTCAAATCAGACTTTAAAACTGAAACCGGAGTGGGGCGGTGGTGGCA  BBBFFFFFGHHHHIJIIJHHHJJJIJJHJJJJ?HCHIIJEHF>BD>BDBA  AS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:50 YS:i:0  YT:Z:CP
ADD REPLYlink written 17 months ago by chrisclarkson10050

I should also mention that I have previously used a script to account for the paired end read data: https://github.com/homeveg/nuctools/extend_PE_reads.pl but this did not seem to make a difference to the intervals, the vast majority of which are < 147. Hence I was worried that it might be a step prior to taking PEs into account.

ADD REPLYlink written 17 months ago by chrisclarkson10050

I don't actually have this file in BAM format... I mapped it directly to sam format with:

bowtie2 -x mm9 --very-sensitive -p 8 -1 SRR1.fastq -2 SRR2.fastq -S out.sam

so the only file here is in sam format

ADD REPLYlink modified 17 months ago • written 17 months ago by chrisclarkson10050

Look, it is pretty simple. Take the sam directly from bowtie without further manipulation and then use the bedpe script I posted. It will get the job done.

ADD REPLYlink written 17 months ago by ATpoint13k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 845 users visited in the last hour