Question: how to add the sample name to the end read headers
0
gravatar for juan.galarza
7 months ago by
juan.galarza0 wrote:

I would need to add the sample name at the end of all the read headers in that fasta sample. For example I have

#Sample1
#>read1
#ATGC
#Sample2
#>read1
#ATGC

Desire output:

#Sample1
#>read1/Sample1 
#ATGC 
#Sample2
#> read1/Sample2 
#ATGC

I can do it one by one using sed

sed 's/read1/read1\/Sample1/g' Sample1.fasta > Sample1_tagged.fasta

However I have hundreds of fasta samples. Any tips on how to do it all at once will be highly appreciated.

relablel header fasta • 389 views
ADD COMMENTlink modified 5 months ago by Biostar ♦♦ 20 • written 7 months ago by juan.galarza0
1

are these Sample1 and Sample2 file names? If you do not provide sufficient information, it would be xy problem and solutions posted here will be of no use. juan.galarza. If they are in different files:

$ awk -v OFS="\n" '/^>/ {getline seq} {print $0"/"FILENAME,seq}' Sample*

or

$ sed -e ' />/ F' Sample* | paste  - - - | awk '{print $2"/"$1"\n"$3}'

>read1/Sample1
ATGC
>read1/Sample2
ATGC

input files (Sample1 and Sample2)

$ tail -n+1 Sample*
==> Sample1 <==
>read1
ATGC

==> Sample2 <==
>read1
ATGC
ADD REPLYlink modified 7 months ago • written 7 months ago by cpad011211k
sed '/^>/s/$/\/SAMPLE/' in.fa > out.fa
ADD REPLYlink written 7 months ago by Pierre Lindenbaum118k

This would append string "SAMPLE" to each header of fasta and is different from OP intended output. OP wants to append sample names (sample 1, sample 2, sample 3 etc) to each sequence. From OP's post, it seems OP has several files name Sample1, Sample 2 etc and each file has a fasta sequence.

ADD REPLYlink modified 7 months ago • written 7 months ago by cpad011211k

Thank you for your answers cpad0112 and Pierre Lindenbaum. Indeed, I have several files named Sample1.fa, Sample2.fa etc...each with sequences in fasta format. I would like to append the file name to the sequences IDs within those files. For example the seq IDs from file Sample1.fa would be

>read1/Sample1 
>read2/sample1

and IDs from file Sample2.fa would be

>read1/Sample2
>read2/Sample2

The awk solution does this, however it produces a single output. Ideally I would like to get the relabelled sequences printed to their corresponding file. I.e. all sequences from file Sample1.fa printed to Sample1_relabel.fa and sequences from Sample2.fa printed to Sample2_relabel.fa etc...

ADD REPLYlink modified 7 months ago by genomax65k • written 7 months ago by juan.galarza0

juan.galarza :

Please use the formatting bar (especially the code option) to present your post better. I've done it for you this time.
code_formatting

Thank you!

ADD REPLYlink written 7 months ago by genomax65k
2
gravatar for cpad0112
7 months ago by
cpad011211k
India
cpad011211k wrote:

try this juan.galarza :

> for i in *.fa ; do awk -v OFS="\n" '/^>/ {getline seq} {print $0"/"FILENAME,seq}' $i > ${i%%.*}"_relabel.fa" ;done

Note: As a precaution, take a back up of your files, run the script on few samples.

If you have GNU-parallel, on your machine, you can try:

$ parallel  "awk -v OFS=\"\n\" '/^>/ {getline seq} {print \$0\"/\"FILENAME,seq}' {} > {.}_relabel.fa" ::: *.fa

you can also dry-run the code:

$ parallel  --dry-run "awk -v OFS=\"\n\" '/^>/ {getline seq} {print \$0\"/\"FILENAME,seq}' {} > {.}_relabel.fa" ::: *.fa
ADD COMMENTlink modified 7 months ago • written 7 months ago by cpad011211k

Thank you!. The for loop did the trick. I didn't try the parallel options since I don't have GNU-parallel in my machine.

ADD REPLYlink written 7 months ago by juan.galarza0

Can you elaborate in why you do not have that? Is your reason covered on https://oletange.wordpress.com/2018/03/28/excuses-for-not-installing-gnu-parallel/

ADD REPLYlink written 7 months ago by ole.tange3.4k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1558 users visited in the last hour