Question: How To Use Coordinates In Order To Extract Sequences In Fasta File?
1
gravatar for srpatel144
6.0 years ago by
srpatel14410
srpatel14410 wrote:

I have obtained a .coords file from MUMmer after aligning my assembly to the reference genome.

I have been asked to write a script that outputs 2 fasta files,

1 which will contain the sequences that were aligned to the reference and 1 which will contain the sequences that were not aligned to the sequence.

I'm not sure how to go about this. How am I supposed to use the coordinates for the sequences that aligned in order to do this?

Also, won't there be parts of sequences that don't align? What/how do I go about collecting those pieces and putting it in the 2nd fasta file?

genome fasta alignment • 11k views
ADD COMMENTlink modified 4.0 years ago by syxbestmayer10 • written 6.0 years ago by srpatel14410

Error: cannot construct subsequence with negative offset or length < 1

why I have this error? somebody can help me?

ADD REPLYlink written 4.0 years ago by syxbestmayer10

Probably because your end < start, but what are you trying to do? 

ADD REPLYlink written 4.0 years ago by Zaag720

I have found my error,it is my question of format.Thank you!

ADD REPLYlink written 4.0 years ago by syxbestmayer10
4
gravatar for Aaronquinlan
6.0 years ago by
Aaronquinlan11k
United States
Aaronquinlan11k wrote:

There is also bedtools getfasta. It uses Erik's code under the hood.

$ cat test.fa
>chr1
AAAAAAAACCCCCCCCCCCCCGCTACTGGGGGGGGGGGGGGGGGG

$ cat test.bed
chr1 5 10

$ bedtools getfasta -fi test.fa -bed test.bed -fo test.fa.out

$ cat test.fa.out
>chr1:5-10
AAACC

Docs: http://bedtools.readthedocs.org/en/latest/content/tools/getfasta.html

And it is wrapped in pybedtools as well: http://pythonhosted.org/pybedtools/autodocs/pybedtools.BedTool.sequence.html?highlight=fasta

ADD COMMENTlink modified 6.0 years ago • written 6.0 years ago by Aaronquinlan11k
1

I wish we could once and for all decide between 0 and 1-based numbering and then whatever we choose, use that universally. Here, I would have said that 5-10 is AAAACC because there's no zeroth nucleotide. I would say that most biologists would have thought the same..

 

ADD REPLYlink modified 4.0 years ago • written 4.0 years ago by 5heikki8.5k

The fasta header coordinates are definitely confusing, but here 0-based makes sense, as BED files are using 0-based half-open coordinates. "Use that universally" won't ever be practical because there are too many standards already in use.

ADD REPLYlink written 4.0 years ago by Matt Shirley9.1k
2
gravatar for Ashutosh Pandey
6.0 years ago by
Philadelphia
Ashutosh Pandey11k wrote:

I think Samtools faidx feature exactly does what you need. If you give a reference genome and coordinates as input it will extract the sequence. Check samtools See below:

faidx samtools faidx <ref.fasta> [region1 [...]] Index reference sequence in the FASTA format or extract subsequence from indexed reference sequence. If no region is specified, faidx will index the file and create <ref.fasta>.fai on the disk. If regions are speficified, the subsequences will be retrieved and printed to stdout in the FASTA format.

You will have to first create the fasta indexes of the reference genome fasta file and then use this command.

ADD COMMENTlink written 6.0 years ago by Ashutosh Pandey11k
1
gravatar for Matt Shirley
6.0 years ago by
Matt Shirley9.1k
Cambridge, MA
Matt Shirley9.1k wrote:

This is a shameless plug, but I just uploaded a pure python implementation of faidx to GitHub. I'm calling it pyfaidx. It will also work for you. You might also check out:

  1. FastaHack
  2. brentp's fastahack-python, which inspired my implementation

Both of which require compiled C code, but aren't actually much faster than the pure python version.

ADD COMMENTlink written 6.0 years ago by Matt Shirley9.1k

nice! I have a Fai pure-python one here: https://gist.github.com/brentp/1842507 that I never polished up.

ADD REPLYlink modified 6.0 years ago • written 6.0 years ago by brentp23k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1787 users visited in the last hour