I want to rename sequence headers
4
0
Entering edit mode
12 months ago
Riyad • 0

I have assembled transcript files with thousands of sequences with headers as like:

>TRINITY_DN50_c0_g2_i1 len=1961 path=[0:0-1960]
>TRINITY_DN59_c0_g1_i2 len=1961 path=[0:0-1960]

But, I want to rename them into as like:

>TRINITY_1
>TRINITY_2

Just all sequences will retain with TRINITY adding chronological number. Total number sequences are 40000

Fasta • 1.5k views
ADD COMMENT
0
Entering edit mode
  1. what file format ?

  2. what programming language?

  3. Have you stored the information in the header in a separate location?

ADD REPLY
0
Entering edit mode

In addition to what Mensur said, I would also state that renaming is not recommended because the string carries meaning. You will, for example, not be able to extract the longest isoform per gene from the edited file, and it will make reproducing subsequent analysis harder. Most tools should be able to deal with the Trinity identifiers. Unless a tool definitely does not support them, I would leave them as they are.

ADD REPLY
0
Entering edit mode

Thanks@ Michael I appreciate this suggestions.

ADD REPLY
3
Entering edit mode
12 months ago
Mensur Dlakic ★ 27k

Many posters think that their problems are unique, but in most cases that's not true. Yours, in particular, is one of most frequently discussed problems. That means that searching for "rename fasta header" from the main page will give you numerous solutions.

https://www.biostars.org/post/search/?query=rename+fasta+header

ADD COMMENT
0
Entering edit mode

Thank@ Mensur

ADD REPLY
1
Entering edit mode
12 months ago
Mark ★ 1.5k

Use seqkit replace, assuming your file name is trinity.fasta:

seqkit replace trinity.fasta -p "(.+)" -r "TRINITY_{nr}" > trinity.renamed.fasta

Where:

  • -p "(.+)" is the match pattern to match the whole header text
  • -r "TRINITY_{nr}" is the replacement pattern, where {nr} adds the record number.

See https://bioinf.shenwei.me/seqkit/usage/#replace for more information

ADD COMMENT
0
Entering edit mode

Thanks@Mark Its working now with seqkit nicely. ....

ADD REPLY
0
Entering edit mode

Please mark the answer as correct.

ADD REPLY
1
Entering edit mode
12 months ago

R version

library(Biostrings)  
fa <- readDNAStringSet('your.fasta')
names(fa) <- paste0('TRINITY_',seq(fa))
writeXStringSet(fa,'your_new.fasta',format='fasta')
ADD COMMENT
0
Entering edit mode
12 months ago
Hugo ▴ 380

You can also use SEDA (https://www.sing-group.org/seda). Specifically, you would use the "Rename header" operation first, to keep the "TRINITY" part of the headers using the "Multipart header" rename type.

enter image description here

Then, you would use the "Rename header" again, but this time with "Add prefix/suffix" rename type to add the indexes.

enter image description here

We will soon release a new SEDA version that comes with a CLI.

ADD COMMENT

Login before adding your answer.

Traffic: 1024 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6