How To Use Coordinates In Order To Extract Sequences In Fasta File?
3
3
Entering edit mode
10.6 years ago
srpatel144 ▴ 30

I have obtained a .coords file from MUMmer after aligning my assembly to the reference genome.

I have been asked to write a script that outputs 2 fasta files,

1 which will contain the sequences that were aligned to the reference and 1 which will contain the sequences that were not aligned to the sequence.

I'm not sure how to go about this. How am I supposed to use the coordinates for the sequences that aligned in order to do this?

Also, won't there be parts of sequences that don't align? What/how do I go about collecting those pieces and putting it in the 2nd fasta file?

genome alignment fasta • 23k views
ADD COMMENT
0
Entering edit mode
Error: cannot construct subsequence with negative offset or length < 1

Why do I get this error? Can somebody help me?

ADD REPLY
0
Entering edit mode

Probably because your end < start, but what are you trying to do?

ADD REPLY
0
Entering edit mode

I have found my error, it is my question of format. Thank you!

ADD REPLY
5
Entering edit mode
10.6 years ago

There is also bedtools getfasta. It uses Erik's code under the hood.

$ cat test.fa
>chr1
AAAAAAAACCCCCCCCCCCCCGCTACTGGGGGGGGGGGGGGGGGG

$ cat test.bed
chr1 5 10

$ bedtools getfasta -fi test.fa -bed test.bed -fo test.fa.out

$ cat test.fa.out
>chr1:5-10
AAACC

Docs: http://bedtools.readthedocs.org/en/latest/content/tools/getfasta.html

And it is wrapped in pybedtools as well: http://pythonhosted.org/pybedtools/autodocs/pybedtools.BedTool.sequence.html?highlight=fasta

ADD COMMENT
1
Entering edit mode

I wish we could once and for all decide between 0 and 1-based numbering and then whatever we choose, use that universally. Here, I would have said that 5-10 is AAAACC because there's no zeroth nucleotide. I would say that most biologists would have thought the same..

ADD REPLY
0
Entering edit mode

The fasta header coordinates are definitely confusing, but here 0-based makes sense, as BED files are using 0-based half-open coordinates. "Use that universally" won't ever be practical because there are too many standards already in use.

ADD REPLY
2
Entering edit mode
10.6 years ago

I think samtools faidx feature exactly does what you need. If you give a reference genome and coordinates as input it will extract the sequence. Check samtools See below:

faidx samtools faidx <ref.fasta> [region1 [...]]
Index reference sequence in the FASTA format or extract subsequence from indexed reference sequence. If no region is specified, faidx will index the file and create <ref.fasta>.fai on the disk. If regions are speficified, the subsequences will be retrieved and printed to stdout in the FASTA format.

You will have to first create the fasta indexes of the reference genome fasta file and then use this command.

ADD COMMENT
2
Entering edit mode
10.6 years ago

This is a shameless plug, but I just uploaded a pure python implementation of faidx to GitHub. I'm calling it pyfaidx. It will also work for you. You might also check out:

  1. FastaHack
  2. brentp's fastahack-python, which inspired my implementation

Both of which require compiled C code, but aren't actually much faster than the pure python version.

ADD COMMENT
0
Entering edit mode

nice! I have a Fai pure-python one here: https://gist.github.com/brentp/1842507 that I never polished up.

ADD REPLY

Login before adding your answer.

Traffic: 1300 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6