Question

How To Extract Specific Coordinates From Multifasta File

6

Entering edit mode

13.1 years ago

Florianino ▴ 70

Hi, I have a genome file in multifasta format

>chr1
ATATATACCGA

>chr2
TGGCGTATTAT

...

and I would like to extract specific coordinates from specific sequences. Coordinates are stored in a three columns file table

chr1 2 4
chr2 4 6
chr2 3 6
...

Ideally the output would compile the extracted sequences in a single file indicating the chromosome and coordinates in the header:

>chr1_2_4
TAT
>chr2_4_6
CGT
...

Could anyone help me with this?

Thanks a lot in advance Florianino

extraction fasta coordinates • 11k views

ADD COMMENT • link updated 13.1 years ago by Florianino ▴ 30 • written 13.1 years ago by Florianino ▴ 70

Ram · Answer 1 · 2011-03-12

Here is my solution. Please have the seqinr package installed and don't forget to chmod +x

#!/usr/bin/env Rscript

library(seqinr)

seqs <- read.fasta("a001.fasta")
locs <- read.delim("a001.coords", header=F, sep=" ") # Tabs? Put sep="\t"
outf <- file("a001-results.fasta", 'w')

doitall <- function(x) {
  seq_id <- x[[1]]; start <- x[[2]]; stop <- x[[3]]; seq <- seqs[[seq_id]]
  seqv   <- seq[start:stop]
  header <- paste(sep="", ">", attr(seq, "name"), "_", start, "_", stop, "\n")
  cat(file=outf, header)
  cat(file=outf, toupper(paste(sep="", collapse="", seqv)), "\n")
}
apply(locs, 1, doitall)
close(outf)

==> a001.coords <==

chr1 2 4
chr2 4 6
chr2 3 6

==> a001.fasta <==

>chr1
ATATATACCGA

>chr2
TGGCGTATTAT

==> a001-results.fasta <==

>chr1_2_4
TAT 
>chr2_4_6
CGT 
>chr2_3_6
GCGT

Ram · Answer 2 · 2011-03-11

If you want to work with a plain text FASTA file, the fastaFromBed program in BEDTools will do this for you. The two inputs are your genome file in FASTA format and the coordinates in BED format. I would suggest using the unreleased version in the repository, as it uses Erik Garrison's nice FastaHack library for speed. If you already heave a .fai file from samtools, it will use that. Otherwise, it will create an index the first time, and reuse it thereafter.

fastaFromBed -fi in.fasta -bed regions.bed -fo out.regions.fa

Ram · Answer 3 · 2011-03-11

2

Entering edit mode

13.1 years ago

Pierre Lindenbaum 161k

You could try to use:

fastacmd (NCBI): see http://www.ncbi.nlm.nih.gov/staff/tao/URLAPI/formatdb_fastacmd.html
nibFrag or twoBitToFa: http://hgwdev.cse.ucsc.edu/~kent/src/unzipped/utils

See also: Fastest Way Of Extracting Millions Of Short Sequences From The Human Genome

ADD COMMENT • link updated 4.6 years ago by Ram 43k • written 13.1 years ago by Pierre Lindenbaum 161k

Ram · Answer 4 · 2011-03-12

1

Entering edit mode

13.1 years ago

Florianino ▴ 30

Hi thanks a lot guys, I'm very new in the field, I really appreciate your help, bla bla bla

I have installed bedtool and tried fastafromBED but it looks like when I ask for positions 1 to 25, it gives me 2 to 25 instead in the output. How come?

For the other solution, I had to install R. I think there may a problem with the script because I use tab delimited file for coords and there are only three fields (BED format).

Cheers

Florianino

ADD COMMENT • link 13.1 years ago by Florianino ▴ 30

1

Entering edit mode

Hi Florianino. Welcome to Biostar. I'm glad you installed R - it's the front-line tool in Bioinformatics. For tab delimited cords files, simply change sep=" " to sep="\t" on the read.delim line.

ADD REPLY • link 13.1 years ago by Aleksandr Levchuk 3.2k

0

Entering edit mode

BED format uses zero-based, half-open coordinates, so the first 25 bases of a sequence are in the range 0-25 (those bases being numbered 0 to 24).

ADD REPLY • link 13.1 years ago by biobot 0.0.77.a.1099 6.2k

0

Entering edit mode

This would be better posed as a new question.

ADD REPLY • link 13.1 years ago by biobot 0.0.77.a.1099 6.2k

0

Entering edit mode

ok thanks, well I guess that's obvious for everyone.. BEDtools looks very handy! Florianino

ADD REPLY • link 13.1 years ago by Florianino ▴ 30

0

Entering edit mode

@Florianino - I don't think this is obvious at all, see the following question inspired by your post: What Are The Advantages/Disadvantages Of One-Based Vs. Zero-Based Genome Coordinate Systems

ADD REPLY • link updated 4.6 years ago by Ram 43k • written 13.1 years ago by Casey Bergman 18k

score 0 · Answer 5 · 2011-04-01

0

Entering edit mode

13.1 years ago

Florianino ▴ 30

Ok, I'll post the BED issue right now as new question. Thanks Aleksandr, i'll try that and let you know. Thank you every one. Things are much easier when people are connected Cheers

ADD COMMENT • link 13.1 years ago by Florianino ▴ 30