Question: How can I add sequences to my BED file?
0
gravatar for saygingulec
3 months ago by
saygingulec0 wrote:

I have a text file formatted like BED and I want to add the sequences as a column on to it. Bedtools getfasta should be doing that with -bedOut option but as explained in this post (https://github.com/arq5x/bedtools2/issues/571), it doesn't work as it is supposed to. I also tried to download the sequences using UCSC's REST API with requests in Python but it is too slow for the number of lines in my file. My last resort will be to download the whole genome as a FASTA file and try to parse that but it tedious to do and introduces extra steps, files and scripts to my pipeline.

Does anybody know an alternative to bedtools getfasta -bedOut or any suggestions to speed up the downloads from UCSC's REST API?

Thanks in advance.

bed bedtools • 180 views
ADD COMMENTlink modified 3 months ago • written 3 months ago by saygingulec0

Maybe if you give us a sample of your input we could try to reproduce the phenomena.

ADD REPLYlink written 3 months ago by tiago2112871.2k

The input looks like this:

chrI       11140   11240   1       NM_058259.4     +
chrI       11240   11340   2       NM_058259.4     +
chrI       11340   11440   3       NM_058259.4     +
chrI       11440   11540   4       NM_058259.4     +
chrI       11540   11640   5       NM_058259.4     +

And I am using this code:

bedtools getfasta -fi chrI.fa -bed genes_divided.bed -fo genes_divided_seqs.bed -bedOut
ADD REPLYlink written 3 months ago by saygingulec0

Perhaps modify this answer for your assembly:

https://bioinformatics.stackexchange.com/questions/14015/how-to-retrieve-the-dna-sequence-from-a-particular-species-and-region-through-co/14030#14030

You'll need to pass in a region file to samtools, where coordinates are in chr:start-stop format. That's easy to do with awk.

The output from samtools can be written to a column of your BED file with paste.

ADD REPLYlink written 3 months ago by Alex Reynolds31k

I tried that but I get this error:

[W::fai_get_val] Reference I 11140 11240 not found in FASTA file, returning empty sequence
[faidx] Failed to fetch sequence in I 11140 11240

P.S. Chromosome names in my file look like "I" instead of "chrI". I fixed the fasta files for that.

ADD REPLYlink modified 3 months ago • written 3 months ago by saygingulec0

The chromosome naming scheme you choose will need to be consistent when doing queries. If your BED file and FASTA use different schemes, queries won't work.

ADD REPLYlink written 3 months ago by Alex Reynolds31k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1252 users visited in the last hour