Prepare file for Multiple alignment
1
0
Entering edit mode
9.6 years ago
Tark ▴ 50

Hi everyone,

What I need to do is I need multiple sequence alignment file

I converted the vcf file to consensus fasta and cat all these consensus sequence into multifasta.

In my cat file seq are >1......>10000 for a.fasta and >1......>10000 for b.fasta when I use clustalw it gives error that sequence have same header THAT MAKES SENSE so what I am planning to do is to convert >1......>10000 to >1 and >1......>10000 for b.fasta to >2 and same for for all my 5 samples

Guide me with command to do that

Please do suggest any help

Regards

next-gen • 3.6k views
ADD COMMENT
0
Entering edit mode

So a.fasta contains 10000 sequences with headers 1, 2, 3, .... 10000 and b.fasta contains 10000 sequences with headers 1, 2, 3, .... 10000? You have 5 fasta files like that (a, b, c, d, e) and you want to combine them to a single file keeping their identity, right? It is easy, just change the header in each file and combine them using cat:

sed -i 's/^>/>A_/g' a.fasta
sed -i 's/^>/>B_/g' b.fasta
sed -i 's/^>/>C_/g' c.fasta
sed -i 's/^>/>D_/g' d.fasta
sed -i 's/^>/>E_/g' e.fasta

cat a.fasta b.fasta c.fasta d.fasta e.fasta >> combined.fasta

(If you want to use numbers instead of letters, substitute letters with numbers in the above sed commands)

ADD REPLY
0
Entering edit mode

Thank you let me try and see whether it answers my problem

ADD REPLY
0
Entering edit mode

Can you suggest me how I can combine

a.fasta with 10000 sequences having headers 1, 2, 3, .... 10000 .......................into just one header e.g

>1
AAATTTTGGGGCCC
>2
ACCCCGGGTTT
..........
>10000
ATGCCCCCCCCCC
>1
AAATTTTGGGGCCCACCCCGGGTTTATGCCCCCCCCCC

Please suggest

ADD REPLY
0
Entering edit mode
cat <(head -n 1 a.fasta) <(grep -v ">" a.fasta | tr -d "\n") > output.fasta

Concatenate

1. the first line of a.fasta

and

2. all the lines of a.fasta that don't have a ">" with line-breaks deleted

ADD REPLY
0
Entering edit mode

Thanks let me try this

I understand first part but what does this mean

tr -d "\n"
ADD REPLY
0
Entering edit mode

to remove the new line character

ADD REPLY
0
Entering edit mode

Thank you

I did as directed now I have file with one header e.g

>A_1
GGATGCCCAGCTAGTTTGAATTTTAGATAAACAACGAATAATTTCGTAGCATAAATATGTCCCAAGCTTAGTTTGGGACATACTTATGCTAAAAAACATTATTGGT

I did this with two files

>B_1
GGATGCCCAGCTAGTTTGAATTTTAGATAAACAACGAATAATTTCGTAGCATAAATATGTCCCAAGCTTAGTTTGGGACATACTTATGCTAAAAAACATTATTGGT

Now I want to combine them when I did I get

cat a.fasta b.fasta >> combined.fasta

When I want to check both come in one file or not I did

grep -c "^>" combined.fasta

It gives 1. My question is why not 2?

And when I upload this multifasta file on galaxy for multiple alignment it gives badly formatted file error

Please help

ADD REPLY
1
Entering edit mode
9.6 years ago
5heikki 11k

It's because there are no \n at the end of the sequences. In your combined fasta >B_1 is next to the last nucleotide of >A_1 sequence, i.e. there's only one line that begins with > (grep -c "^>"), grep -c ">" would return 2. To avoid this you can for example awk '{print}' a.fasta b.fasta > combined.fasta. Also, it is generally a good idea to use > instead of >> because >> appends.

Edit. meant to post as a comment

ADD COMMENT
0
Entering edit mode

ok thank you

ADD REPLY
0
Entering edit mode

Thank you it (awk '{print}' a.fasta b.fasta >combined.fasta) was good.

Can I ask you one question not related to this

I have to count number of bases in fasta file and using this command

grep -v ">" a.fasta | wc | awk '{print$3-$1}'

I know this command is correct as it gives me same result on galaxy

can you explain what $3-$1 means

I saw on awk manual that $ is for column and - for subtraction but in fasta file we don't have columns so how to understand

ADD REPLY

Login before adding your answer.

Traffic: 2002 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6