Question: Rename fasta headers from CSV
0
gravatar for SaltedPork
26 days ago by
SaltedPork100
SaltedPork100 wrote:

I have a CSV file that looks like this:

   201200175,A/name1/175/2012
   201200287,A/name2/287/2012
   201200845,A/name3/845/2012

Currently my fasta headers look like:

>201200175_AA
>201200175_AB
>201200175_BB

and I want to change it to:

>A/name1/175/2012_AA
>A/name1/175/2012_AB
>A/name1/175/2012_BB

I want to preserve the suffix (_AA etc..). I have multiple fasta files, and they are all multifastas.

I was wondering if there is a quicker way in bash rather than writing out some Perl...

bash python perl • 124 views
ADD COMMENTlink modified 26 days ago by SMK1.3k • written 26 days ago by SaltedPork100

Do the fasta files have linebreaks in seqs? If not, then perhaps e.g.

join -1 1 -2 1 -t $'\t' \
    <(sed 's/,/\t/' file.csv | sort -k1,1) \
    <(paste - - <file.fa | sed -e 's/_/\t/' -e 's/>//' | sort -k1,1) \
    | awk 'BEGIN{FS="\t"}{print ">"$2"_"$3"\n"$4}'
ADD REPLYlink modified 26 days ago • written 26 days ago by 5heikki8.4k

Thanks, no they don't.

ADD REPLYlink written 26 days ago by SaltedPork100
3
gravatar for SMK
26 days ago by
SMK1.3k
Ghent, Belgium
SMK1.3k wrote:

Hi SaltedPork,

Try seqkit:

$ cat changes.csv
201200175,A/name1/175/2012
201200287,A/name2/287/2012
201200845,A/name3/845/2012

$ cat sample.fasta
>201200175_AA
AGCTAGCTAGCTGCATGCTGCATGCTACG
>201200175_AB
AGCTGCATGCTAGCTGATCGTAGCTAGCT
>201200175_BB
GCTAGCTAGCTGCATGCTAGCTAGCTGCT

$ seqkit replace --kv-file <(sed "s/,/\t/g" changes.csv) --pattern "^(\d+)_(\w+)" --replacement "{kv}_\${2}" sample.fasta
[INFO] read key-value file: /dev/fd/63
[INFO] 3 pairs of key-value loaded
>A/name1/175/2012_AA
AGCTAGCTAGCTGCATGCTGCATGCTACG
>A/name1/175/2012_AB
AGCTGCATGCTAGCTGATCGTAGCTAGCT
>A/name1/175/2012_BB
GCTAGCTAGCTGCATGCTAGCTAGCTGCT
ADD COMMENTlink written 26 days ago by SMK1.3k

Thanks for this answer, very clear.

I am getting fasta headers like >_AA, >_BB. The bit from the CSV is not there.

Could you explain what the sed and regex bits are doing. Sed is replacing commas with tabs in the csv, is this because seqkit doesn't handle the commas? Also what is the _ doing in the pattern bit?

ADD REPLYlink written 26 days ago by SaltedPork100

Hi SaltedPork,

Sed is replacing commas with tabs in the csv, is this because seqkit doesn't handle the commas?

Yes, as seqkit replace -h shows: -k, --kv-file string tab-delimited key-value file for replacing key with value when using "{kv}" in -r (--replacement) (only for sequence name)

"^(\d+)_(\w+)" is the regex for 201200175_AA, 201200175_AB, and 201200175_BB. It's the part that you want to replace. _ is the symbol between 201200175 and AA.

ADD REPLYlink written 26 days ago by SMK1.3k

Thanks, but my headers don't have the second value in them, just the _AA. Any ideas? I've tried playing with the regex but no luck.

ADD REPLYlink written 26 days ago by SaltedPork100

Ah... Sorry didn't realize that you got leading spaces in your csv. Try this: seqkit replace --kv-file <(sed -r "s/^\s+//g; s/,/\t/g" changes.csv) --pattern "^(\d+)_(\w+)" --replacement '{kv}_${2}' sample.fasta

ADD REPLYlink written 26 days ago by SMK1.3k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 935 users visited in the last hour