Question: How can I add sequences to my BED file?
0
gravatar for saygingulec
26 days ago by
saygingulec0 wrote:

I have a text file formatted like BED and I want to add the sequences as a column on to it. Bedtools getfasta should be doing that with -bedOut option but as explained in this post (https://github.com/arq5x/bedtools2/issues/571), it doesn't work as it is supposed to. I also tried to download the sequences using UCSC's REST API with requests in Python but it is too slow for the number of lines in my file. My last resort will be to download the whole genome as a FASTA file and try to parse that but it tedious to do and introduces extra steps, files and scripts to my pipeline.

Does anybody know an alternative to bedtools getfasta -bedOut or any suggestions to speed up the downloads from UCSC's REST API?

Thanks in advance.

bed bedtools • 107 views
ADD COMMENTlink modified 26 days ago • written 26 days ago by saygingulec0

Maybe if you give us a sample of your input we could try to reproduce the phenomena.

ADD REPLYlink written 26 days ago by tiago2112871.2k

The input looks like this:

chrI       11140   11240   1       NM_058259.4     +
chrI       11240   11340   2       NM_058259.4     +
chrI       11340   11440   3       NM_058259.4     +
chrI       11440   11540   4       NM_058259.4     +
chrI       11540   11640   5       NM_058259.4     +

And I am using this code:

bedtools getfasta -fi chrI.fa -bed genes_divided.bed -fo genes_divided_seqs.bed -bedOut
ADD REPLYlink written 26 days ago by saygingulec0

Perhaps modify this answer for your assembly:

https://bioinformatics.stackexchange.com/questions/14015/how-to-retrieve-the-dna-sequence-from-a-particular-species-and-region-through-co/14030#14030

You'll need to pass in a region file to samtools, where coordinates are in chr:start-stop format. That's easy to do with awk.

The output from samtools can be written to a column of your BED file with paste.

ADD REPLYlink written 26 days ago by Alex Reynolds30k

I tried that but I get this error:

[W::fai_get_val] Reference I 11140 11240 not found in FASTA file, returning empty sequence
[faidx] Failed to fetch sequence in I 11140 11240

P.S. Chromosome names in my file look like "I" instead of "chrI". I fixed the fasta files for that.

ADD REPLYlink modified 25 days ago • written 25 days ago by saygingulec0

The chromosome naming scheme you choose will need to be consistent when doing queries. If your BED file and FASTA use different schemes, queries won't work.

ADD REPLYlink written 25 days ago by Alex Reynolds30k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1548 users visited in the last hour