Question: Prepare file for Multiple alignment
0
gravatar for Tark
4.6 years ago by
Tark50
japan
Tark50 wrote:

hi every one,

what i need to do is i need multiple sequence alignment file 

i converted the vcf file to consensus fasta  and cat all these consensus sequence into multifasta.

in my cat file seq are >1......>10000  for a.fasta and >1......>10000  for b.fasta when i use clustal w it gives error that sequence have same header THAT MAKES SENSE  so what i am planning to do is to convert  >1......>10000 to >1

and >1......>10000  for b.fasta to >2  and same for for all my 5 samples

 

guide me with command to do that

please do suggest any help

Regards

 

next-gen • 1.7k views
ADD COMMENTlink modified 4.6 years ago by 5heikki8.4k • written 4.6 years ago by Tark50

So a.fasta contains 10000 sequences with headers 1, 2, 3, .... 10000 and b.fasta contains 10000 sequences with headers 1, 2, 3, .... 10000? You have 5 fasta files like that (a, b, c, d, e)  and you want to combine them to a single file keeping their identity, right? It is easy, just change the header in each file and combine them using cat:

sed -i 's/^>/>A_/g' a.fasta

sed -i 's/^>/>B_/g' b.fasta

sed -i 's/^>/>C_/g' c.fasta

sed -i 's/^>/>D_/g' d.fasta

sed -i 's/^>/>E_/g' e.fasta

cat a.fasta b.fasta c.fasta d.fasta e.fasta >> combined.fasta

(If you want to use numbers instead of letters, substitute letters with numbers in the above sed commands)

ADD REPLYlink modified 4.6 years ago • written 4.6 years ago by arnstrm1.7k

thank you let me try and see whether it answers my problem 

 

ADD REPLYlink written 4.6 years ago by Tark50

can you suggest me how i can combine 

a.fasta with 10000 sequences having headers 1, 2, 3, .... 10000  .......................into just  one header e.g

>1

AAATTTTGGGGCCC

>2

ACCCCGGGTTT

..........

>10000

ATGCCCCCCCCCC

 

>1

AAATTTTGGGGCCCACCCCGGGTTTATGCCCCCCCCCC

 

please suggest

ADD REPLYlink written 4.6 years ago by Tark50
cat <(head -n 1 a.fasta) <(grep -v ">" a.fasta | tr -d "\n") > output.fasta

 

Concatenate

1. the first line of a.fasta

and

2. all the lines of a.fasta that don't have a ">" with line-breaks deleted

ADD REPLYlink modified 4.6 years ago • written 4.6 years ago by 5heikki8.4k

thanks let me try this 

i understand first part but what does this means 

tr -d "\n")
ADD REPLYlink written 4.6 years ago by Tark50

to remove the new line character

ADD REPLYlink written 4.6 years ago by Sam2.3k

thank you 

i did as directed now i have file with one header e.g

>A_1
GGATGCCCAGCTAGTTTGAATTTTAGATAAACAACGAATAATTTCGTAGCATAAATATGTCCCAAGCTTAGTTTGGGACATACTTATGCTAAAAAACATTATTGGT

i did this with two files

>B_1
GGATGCCCAGCTAGTTTGAATTTTAGATAAACAACGAATAATTTCGTAGCATAAATATGTCCCAAGCTTAGTTTGGGACATACTTATGCTAAAAAACATTATTGGT

now i want to combine them when i did i get 

cat a.fasta b.fasta >> combined.fasta

when i want to check both come in one file or not i did

grep -c "^>" combined.fasta

it gives 1

my question is why not 2

and when i upload this multifasta file on galaxy for multiple alignment it gives badly formatted file error

please help

 

 

 

ADD REPLYlink modified 4.6 years ago • written 4.6 years ago by Tark50
1
gravatar for 5heikki
4.6 years ago by
5heikki8.4k
Finland
5heikki8.4k wrote:

It's because there are no "\n" at the end of the sequences. In your combined fasta ">B_1" is next to the last nucleotide of ">A_1" sequence, i.e. there's only one line that begins with ">" (grep -c "^>"), grep -c ">" would return 2. To avoid this you can for example awk '{print}' a.fasta b.fasta > combined.fasta. Also, it is generally a good idea to use ">" instead of ">>" because ">>" appends.

edit. meant to post as a comment

ADD COMMENTlink modified 4.6 years ago • written 4.6 years ago by 5heikki8.4k

ok thank you

ADD REPLYlink written 4.6 years ago by Tark50

thank you it (awk '{print}' a.fasta b.fasta >combined.fasta) was good.

can i ask you one quest

 not realted to this 

i have to count number of bases in fasta file and using this commnand

grep -v ">" a.fasta |wc| awk '{print$3-$1}'

i know this command is correct as it gives me same result on galaxy

can you explain what $3-$1 means

i saw on awk manual that $ is for column and - for subtraction but in fasta file we dont have columns so how to understand 

ADD REPLYlink written 4.6 years ago by Tark50
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1097 users visited in the last hour