How do I remove certain sequences in fast based on header?
3
0
Entering edit mode
16 months ago
tianshenbio ▴ 120

I have a fasta file like this:

>XM_0000001.1 
actact
>XR_0000001.1
atcatc

How do I remove all the sequences with a XR header?

I only want to keep:

>XM_0000001.1
actact
RNA-Seq sequence fasta • 678 views
ADD COMMENT
0
Entering edit mode
16 months ago

If you do it on linux,it will be easy.

  1. Step 1: grep “>” file.fa | sed 's/>//g' > file.fa.id
  2. Step 2: grep -v 'XR_' file.fa.id > file.fa.id.final
  3. step 3: seqtk subseq file.fa file.fa.id.final > final.fa

PS: Seqtk is a software that you need to install.

edit:formatting.

ADD COMMENT
0
Entering edit mode
16 months ago

try with gnu-sed on ubuntu/mint:

$ sed  -e '/^>XR/,+1d' test.fa

If you have multiline fasta, use seqkit:

$ seqkit grep -rvip "^XR" test.fa
ADD COMMENT
0
Entering edit mode
16 months ago
Hugo ▴ 340

You can try SEDA (https://www.sing-group.org/seda/). The Pattern filtering operation (https://www.sing-group.org/seda/manual/operations.html#pattern-filtering) would allow you to do this if you configure a Not contains pattern with the "^XR_" text.

ADD COMMENT

Login before adding your answer.

Traffic: 1884 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6