Question: How to work with fasta files and make phylogenetics in R?
0
gravatar for rimgubaev
4 weeks ago by
rimgubaev120
Russia/Moscow/Skoltech
rimgubaev120 wrote:

I wonder if someone could suggest me a tool for manipulation with fasta files as well as for calculation of distances and making the phylogenetic trees in R.

My specific task is the following: I got the VCF recoded to multi-fasta file so the header corresponds to individual and the each SNP is presented by a nucleotide (in case if the nucleotide hasn't been read in the position it is N and in case of heterozygous site it is R, M, S, etc), the lengths of sequences is similar for each individual (in other words it is kind of already "aligned" fasta). Then I would like to perform the following manipulations: I want to upload the fasta as a dataframe so the individuals would be row names and the nucleotide will be present in column cells, so it would be possible to operate with them. For example: remove all heterozygous SNPs or positions with N etc. After that, I would like to calculate the distances (playing with methods here) between the samples and make an nj tree with bootstrap support.

I tried to do it with ape/phangorn but still with no success (I tried to load fasta a as dataframe to operate with it but failed), maybe my idea is totally wrong an I should choose another tool or approach. If somebody could suggest some tutorials I would be grateful.

ADD COMMENTlink modified 4 weeks ago • written 4 weeks ago by rimgubaev120

If you needed individual variant information, why recode the VCF to a multi-fasta? Why not work directly on the VCF file?

And why do you wish to use R to get from what you have to a phylogenetic tree? Why not look for available tools that could go from VCF to tree?

ADD REPLYlink written 4 weeks ago by RamRS21k

Yeah, I understand that you mean tools like SNPhylo and I agree that it's ok. But the key thing in my case is the ability to manipulate the SNPs, namely remove ones (heterozygous or unread SNPs) and look for changes in the tree.

ADD REPLYlink written 4 weeks ago by rimgubaev120

So why not work directly on the VCF? You can manipulate the VCF to get variants to a ./. state and use that to re-generate the tree.

ADD REPLYlink written 4 weeks ago by RamRS21k

What software should I use in this case?

ADD REPLYlink written 4 weeks ago by rimgubaev120
1
gravatar for rimgubaev
4 weeks ago by
rimgubaev120
Russia/Moscow/Skoltech
rimgubaev120 wrote:

Here is the partial answer for my case, namely the way how we can read fasta into matrix, filled with letters using ape R package.

# reading fasta file as a matrix    
seqdata <- read.dna(file="your.fasta", 
                   format = "fasta", as.character = T)

# performing some manipulations (removing positions, selecting/renaming the samples, etc.)
# NOTE! that letters in matrix are in lowercase so this command could be useful
seqdata <- toupper(seqdata)

# writing the output
# NOTE! that nbcol should be set to -1 if we want the sequence to be one string. and colsep should be "" if we want unsplit seuqnce 
write.dna(seqdata, "your.new.fasta", format = "fasta", nbcol = -1, colsep = "")
ADD COMMENTlink written 4 weeks ago by rimgubaev120
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 928 users visited in the last hour