how to add the sample name to the end read headers
1
0
Entering edit mode
2.8 years ago

I would need to add the sample name at the end of all the read headers in that fasta sample. For example I have

#Sample1
#>read1
#ATGC
#Sample2
#>read1
#ATGC

Desire output:

#Sample1
#>read1/Sample1 
#ATGC 
#Sample2
#> read1/Sample2 
#ATGC

I can do it one by one using sed

sed 's/read1/read1\/Sample1/g' Sample1.fasta > Sample1_tagged.fasta

However I have hundreds of fasta samples. Any tips on how to do it all at once will be highly appreciated.

fasta relablel header • 771 views
ADD COMMENT
1
Entering edit mode

are these Sample1 and Sample2 file names? If you do not provide sufficient information, it would be xy problem and solutions posted here will be of no use. juan.galarza. If they are in different files:

$ awk -v OFS="\n" '/^>/ {getline seq} {print $0"/"FILENAME,seq}' Sample*

or

$ sed -e ' />/ F' Sample* | paste  - - - | awk '{print $2"/"$1"\n"$3}'

>read1/Sample1
ATGC
>read1/Sample2
ATGC

input files (Sample1 and Sample2)

$ tail -n+1 Sample*
==> Sample1 <==
>read1
ATGC

==> Sample2 <==
>read1
ATGC
ADD REPLY
0
Entering edit mode
sed '/^>/s/$/\/SAMPLE/' in.fa > out.fa
ADD REPLY
0
Entering edit mode

This would append string "SAMPLE" to each header of fasta and is different from OP intended output. OP wants to append sample names (sample 1, sample 2, sample 3 etc) to each sequence. From OP's post, it seems OP has several files name Sample1, Sample 2 etc and each file has a fasta sequence.

ADD REPLY
0
Entering edit mode

Thank you for your answers cpad0112 and Pierre Lindenbaum. Indeed, I have several files named Sample1.fa, Sample2.fa etc...each with sequences in fasta format. I would like to append the file name to the sequences IDs within those files. For example the seq IDs from file Sample1.fa would be

>read1/Sample1 
>read2/sample1

and IDs from file Sample2.fa would be

>read1/Sample2
>read2/Sample2

The awk solution does this, however it produces a single output. Ideally I would like to get the relabelled sequences printed to their corresponding file. I.e. all sequences from file Sample1.fa printed to Sample1_relabel.fa and sequences from Sample2.fa printed to Sample2_relabel.fa etc...

ADD REPLY
0
Entering edit mode

juan.galarza :

Please use the formatting bar (especially the code option) to present your post better. I've done it for you this time.
code_formatting

Thank you!

ADD REPLY
2
Entering edit mode
2.8 years ago

try this juan.galarza :

> for i in *.fa ; do awk -v OFS="\n" '/^>/ {getline seq} {print $0"/"FILENAME,seq}' $i > ${i%%.*}"_relabel.fa" ;done

Note: As a precaution, take a back up of your files, run the script on few samples.

If you have GNU-parallel, on your machine, you can try:

$ parallel  "awk -v OFS=\"\n\" '/^>/ {getline seq} {print \$0\"/\"FILENAME,seq}' {} > {.}_relabel.fa" ::: *.fa

you can also dry-run the code:

$ parallel  --dry-run "awk -v OFS=\"\n\" '/^>/ {getline seq} {print \$0\"/\"FILENAME,seq}' {} > {.}_relabel.fa" ::: *.fa
ADD COMMENT
0
Entering edit mode

Thank you!. The for loop did the trick. I didn't try the parallel options since I don't have GNU-parallel in my machine.

ADD REPLY
0
Entering edit mode

Can you elaborate in why you do not have that? Is your reason covered on https://oletange.wordpress.com/2018/03/28/excuses-for-not-installing-gnu-parallel/

ADD REPLY

Login before adding your answer.

Traffic: 2376 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6