Question

Prepare file for Multiple alignment

0

Entering edit mode

10.9 years ago

Tark ▴ 50

Hi everyone,

What I need to do is I need multiple sequence alignment file

I converted the vcf file to consensus fasta and cat all these consensus sequence into multifasta.

In my cat file seq are >1......>10000 for a.fasta and >1......>10000 for b.fasta when I use clustalw it gives error that sequence have same header THAT MAKES SENSE so what I am planning to do is to convert >1......>10000 to >1 and >1......>10000 for b.fasta to >2 and same for for all my 5 samples

Guide me with command to do that

Please do suggest any help

Regards

next-gen • 4.4k views

ADD COMMENT • link updated 3.5 years ago by Ram 45k • written 10.9 years ago by Tark ▴ 50

0

Entering edit mode

So a.fasta contains 10000 sequences with headers 1, 2, 3, .... 10000 and b.fasta contains 10000 sequences with headers 1, 2, 3, .... 10000? You have 5 fasta files like that (a, b, c, d, e) and you want to combine them to a single file keeping their identity, right? It is easy, just change the header in each file and combine them using cat:

sed -i 's/^>/>A_/g' a.fasta
sed -i 's/^>/>B_/g' b.fasta
sed -i 's/^>/>C_/g' c.fasta
sed -i 's/^>/>D_/g' d.fasta
sed -i 's/^>/>E_/g' e.fasta

cat a.fasta b.fasta c.fasta d.fasta e.fasta >> combined.fasta

(If you want to use numbers instead of letters, substitute letters with numbers in the above sed commands)

ADD REPLY • link updated 3.5 years ago by Ram 45k • written 10.9 years ago by arnstrm ★ 1.9k

0

Entering edit mode

Thank you let me try and see whether it answers my problem

ADD REPLY • link updated 3.5 years ago by Ram 45k • written 10.9 years ago by Tark ▴ 50

0

Entering edit mode

Can you suggest me how I can combine

a.fasta with 10000 sequences having headers 1, 2, 3, .... 10000 .......................into just one header e.g

>1
AAATTTTGGGGCCC
>2
ACCCCGGGTTT
..........
>10000
ATGCCCCCCCCCC

>1
AAATTTTGGGGCCCACCCCGGGTTTATGCCCCCCCCCC

Please suggest

ADD REPLY • link updated 3.5 years ago by Ram 45k • written 10.9 years ago by Tark ▴ 50

0

Entering edit mode

cat <(head -n 1 a.fasta) <(grep -v ">" a.fasta | tr -d "\n") > output.fasta

Concatenate

1. the first line of a.fasta

and

2. all the lines of a.fasta that don't have a ">" with line-breaks deleted

ADD REPLY • link updated 3.5 years ago by Ram 45k • written 10.8 years ago by 5heikki 11k

0

Entering edit mode

Thanks let me try this

I understand first part but what does this mean

tr -d "\n"

ADD REPLY • link updated 3.5 years ago by Ram 45k • written 10.8 years ago by Tark ▴ 50

0

Entering edit mode

to remove the new line character

ADD REPLY • link updated 3.5 years ago by Ram 45k • written 10.8 years ago by Sam ★ 4.8k

0

Entering edit mode

Thank you

I did as directed now I have file with one header e.g

>A_1
GGATGCCCAGCTAGTTTGAATTTTAGATAAACAACGAATAATTTCGTAGCATAAATATGTCCCAAGCTTAGTTTGGGACATACTTATGCTAAAAAACATTATTGGT

I did this with two files

>B_1
GGATGCCCAGCTAGTTTGAATTTTAGATAAACAACGAATAATTTCGTAGCATAAATATGTCCCAAGCTTAGTTTGGGACATACTTATGCTAAAAAACATTATTGGT

Now I want to combine them when I did I get

cat a.fasta b.fasta >> combined.fasta

When I want to check both come in one file or not I did

grep -c "^>" combined.fasta

It gives 1. My question is why not 2?

And when I upload this multifasta file on galaxy for multiple alignment it gives badly formatted file error

Please help

ADD REPLY • link updated 3.5 years ago by Ram 45k • written 10.8 years ago by Tark ▴ 50

Ram · Answer 1 · 2014-09-10

1

Entering edit mode

10.8 years ago

5heikki 11k

It's because there are no \n at the end of the sequences. In your combined fasta >B_1 is next to the last nucleotide of >A_1 sequence, i.e. there's only one line that begins with > (grep -c "^>"), grep -c ">" would return 2. To avoid this you can for example awk '{print}' a.fasta b.fasta > combined.fasta. Also, it is generally a good idea to use > instead of >> because >> appends.

Edit. meant to post as a comment

ADD COMMENT • link updated 3.5 years ago by Ram 45k • written 10.8 years ago by 5heikki 11k

0

Entering edit mode

ok thank you

ADD REPLY • link 10.8 years ago by Tark ▴ 50

0

Entering edit mode

Thank you it (awk '{print}' a.fasta b.fasta >combined.fasta) was good.

Can I ask you one question not related to this

I have to count number of bases in fasta file and using this command

grep -v ">" a.fasta | wc | awk '{print$3-$1}'

I know this command is correct as it gives me same result on galaxy

can you explain what $3-$1 means

I saw on awk manual that $ is for column and - for subtraction but in fasta file we don't have columns so how to understand

ADD REPLY • link updated 3.5 years ago by Ram 45k • written 10.8 years ago by Tark ▴ 50