How do I remove certain sequences in fast based on header?
3
0
Entering edit mode
3.8 years ago
tianshenbio ▴ 170

I have a fasta file like this:

>XM_0000001.1 
actact
>XR_0000001.1
atcatc

How do I remove all the sequences with a XR header?

I only want to keep:

>XM_0000001.1
actact
RNA-Seq sequence fasta • 3.4k views
ADD COMMENT
0
Entering edit mode
3.8 years ago

If you do it on linux,it will be easy.

  1. Step 1: grep “>” file.fa | sed 's/>//g' > file.fa.id
  2. Step 2: grep -v 'XR_' file.fa.id > file.fa.id.final
  3. step 3: seqtk subseq file.fa file.fa.id.final > final.fa

PS: Seqtk is a software that you need to install.

edit:formatting.

ADD COMMENT
0
Entering edit mode
3.8 years ago

try with gnu-sed on ubuntu/mint:

$ sed  -e '/^>XR/,+1d' test.fa

If you have multiline fasta, use seqkit:

$ seqkit grep -rvip "^XR" test.fa
ADD COMMENT
0
Entering edit mode
3.8 years ago
Hugo ▴ 380

You can try SEDA (https://www.sing-group.org/seda/). The Pattern filtering operation (https://www.sing-group.org/seda/manual/operations.html#pattern-filtering) would allow you to do this if you configure a Not contains pattern with the "^XR_" text.

ADD COMMENT

Login before adding your answer.

Traffic: 3239 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6