5.9 years ago by
1. What do the "L" and "R" here refer to? Left/right part of the single reads spanning exon-intron junction? Or Left reads and Right reads of the paired-end reads?
I'm going to infer that indeed those are the "left" and "right" paired-end reads, given that the BED name entry seems to indicate a flowcell coordinate, and said coordinate is shared within sets of two reads in your example.
2. How can I convert such bed format into fastq? The Epigenome Roadmap doesn't provide sra for this dataset.
If you truly wanted FASTQ and not FASTA, and the only source you have for the data is this BED file, then you would have to fake the quality scores. But you could construct the rest of the FASTQ like this:
- For the first line of each FASTQ read, use the fourth column of the BED file.
- For the second line of each FASTQ read, you would need to extract the portion of the reference genome given by the first three columns of the BED file. So for the first line of your BED, you would want to have the sequence between bases 24,291,630 and 24,291,704 on chromosome 1, inclusive.
- For the third line of each FASTQ read, just put a '+' [or some arbitrary value(s)]
- For the fourth line, you would need to create fake quality scores, the number of which would correspond to the number of bases you extracted from the reference genome for that read.
This might be made easier through usage of the BedTools getfasta tool.
EDIT: The subject asks for conversion to BAM format, but the question body asks for conversion to FASTQ.
To convert to BAM, there's a tool suite called "Bedtools," which has a tool, BedToBam, that should do the job for you if you supply a reference genome.
modified 5.9 years ago
5.9 years ago by
Dan D ♦ 7.1k