Question

How to work with fasta files and make phylogenetics in R?

0

Entering edit mode

5.3 years ago

rimgubaev ▴ 330

I wonder if someone could suggest me a tool for manipulation with fasta files as well as for calculation of distances and making the phylogenetic trees in R.

My specific task is the following: I got the VCF recoded to multi-fasta file so the header corresponds to individual and the each SNP is presented by a nucleotide (in case if the nucleotide hasn't been read in the position it is N and in case of heterozygous site it is R, M, S, etc), the lengths of sequences is similar for each individual (in other words it is kind of already "aligned" fasta). Then I would like to perform the following manipulations: I want to upload the fasta as a dataframe so the individuals would be row names and the nucleotide will be present in column cells, so it would be possible to operate with them. For example: remove all heterozygous SNPs or positions with N etc. After that, I would like to calculate the distances (playing with methods here) between the samples and make an nj tree with bootstrap support.

I tried to do it with ape/phangorn but still with no success (I tried to load fasta a as dataframe to operate with it but failed), maybe my idea is totally wrong an I should choose another tool or approach. If somebody could suggest some tutorials I would be grateful.

populational genetics R ape phangorn VCF • 4.3k views

ADD COMMENT • link 5.3 years ago by rimgubaev ▴ 330

0

Entering edit mode

If you needed individual variant information, why recode the VCF to a multi-fasta? Why not work directly on the VCF file?

And why do you wish to use R to get from what you have to a phylogenetic tree? Why not look for available tools that could go from VCF to tree?

ADD REPLY • link 5.3 years ago by Ram 44k

0

Entering edit mode

Yeah, I understand that you mean tools like SNPhylo and I agree that it's ok. But the key thing in my case is the ability to manipulate the SNPs, namely remove ones (heterozygous or unread SNPs) and look for changes in the tree.

ADD REPLY • link 5.3 years ago by rimgubaev ▴ 330

0

Entering edit mode

So why not work directly on the VCF? You can manipulate the VCF to get variants to a ./. state and use that to re-generate the tree.

ADD REPLY • link 5.3 years ago by Ram 44k

0

Entering edit mode

What software should I use in this case?

ADD REPLY • link 5.3 years ago by rimgubaev ▴ 330

score 1 · Accepted Answer · 2019-04-18

Here is the partial answer for my case, namely the way how we can read fasta into matrix, filled with letters using ape R package.

# reading fasta file as a matrix    
seqdata <- read.dna(file="your.fasta", 
                   format = "fasta", as.character = T)

# performing some manipulations (removing positions, selecting/renaming the samples, etc.)
# NOTE! that letters in matrix are in lowercase so this command could be useful
seqdata <- toupper(seqdata)

# writing the output
# NOTE! that nbcol should be set to -1 if we want the sequence to be one string. and colsep should be "" if we want unsplit seuqnce 
write.dna(seqdata, "your.new.fasta", format = "fasta", nbcol = -1, colsep = "")