grep command for fasta header
0
1
Entering edit mode
2.5 years ago
harry ▴ 30

I used this command:--

grep -Fw -A 1 -f header.txt test.fa >test_result.fa

But it extracts only 1 header, not the whole which are present in my header.txt file.

my header.txt file looks like:---

hsa_circ_0000006
hsa_circ_0000014
hsa_circ_0000015
hsa_circ_0000042
hsa_circ_0000070
hsa_circ_0000072
hsa_circ_0000131
hsa_circ_0000133
hsa_circ_0000160
hsa_circ_0000175
hsa_circ_0000211
hsa_circ_0000219
hsa_circ_0000231
hsa_circ_0000233
hsa_circ_0000236
hsa_circ_0000258

my test.fa file looks like:--

    >hsa_circ_0000001|chr1:1080738-1080845-|None|None
    GGTCGGCCATGAAGGTGGTGGGGGTCATGAGGTCACAAGGGGGTCGGCCATGTGATGGGGTTGGGTCAGCCGTGCGGTCAGGTCAGGTCGGCCATGAGGTCAGGTGG
    >hsa_circ_0000002|chr1:1158623-1159348-|NM_016176|SDF4
    CTGACGGGGACGGTCACGTGTCTTGGGACGAGTATAAGGTGAAGTTTTTGGCGAGTAAAGGCCATAGCGAGAAGGAGGTTGCCGACGCCATCAGGCTCAACGAGGAACTCAAAGTGGATGAGGAAAGGTGGATGTGAACACTGACCGGAAGATCAGTGCCAAGGAGATGCAGCGCTGGATCATGGAGAAGACGGCCGAGCACTTCCAGGAGGCCATGGAGGAGAGCAAGACACACTTCCGCGCCGTGGACC

>hsa_circ_0000014|chr1:9991948-9994918-|NM_032368|LZIC
CTGTACACTCAACAGAAAGTGGAGATACTAACAGCTCTTAGGAAACTTGGAGAGAAGCTGACTGCAGATGATGAGGCCTTCTTGTCAGCAAATGCAGGTGCTATACTCAGCCAGTTTGAGAAAGTCTCTACAGACCTTGGCTATTCAGGCAGCTATCAGCCAGGCCTTTAAAACCCCAGAGGTCATCAGATTGTTTGCAAAGAAACAACCAGGTCAGCTTCGGACAAGGTTAGCAGAGATGGATAGAGATCTGATGGTAGGAAAGCTGGAAAGAGAC

So please give me suggestions on what am I wrong. Thanks in advance

fasta • 2.4k views
ADD COMMENT
1
Entering edit mode

Try

$ seqkit -w 0 grep -f header.txt -irp "(.*)|chr" test.fa
ADD REPLY
0
Entering edit mode

thanks, it works for me.

ADD REPLY
0
Entering edit mode

But it extracts only 1 header, not the whole which are present in my header.txt file.

What does that mean?

ADD REPLY
0
Entering edit mode

It's means I only get one fasta sequence from my whole header.txt file.

ADD REPLY
0
Entering edit mode

and there is suppose to be on more than one match? (from your test file, only one matches ;) )

ADD REPLY
0
Entering edit mode

yes, but I got only one sequence.

ADD REPLY
0
Entering edit mode

should not make a difference (in theory) but can you try with >> in stead of > in your command line

ADD REPLY
0
Entering edit mode

Your original command should work fine. There must be something else that is odd with your file.

$ more t.fa
>hsa_circ_0000001|chr1:1080738-1080845-|None|None
GGTCGGCCATGAAGGTGGTGGGGGTCATGAGGTCACAAGGGGGTCGGCCATGTGATGGGGTTGGGTCAGCCGTGCGGTCAGGTCAGGTCGGCCATGAGGTCAGGTGG
>hsa_circ_0000002|chr1:1158623-1159348-|NM_016176|SDF4
CTGACGGGGACGGTCACGTGTCTTGGGACGAGTATAAGGTGAAGTTTTTGGCGAGTAAAGGCCATAGCGAGAAGGAGGTTGCCGACGCCATCAGGCTCAACGAGGAACTCAAAGTGGATGAGGAAAGGTGGATGTGAACACTGACCGGAAGATCAGTGCCAAGGAGATGCAGCGCTGGATCATGGAGAAGACGGCCGAGCACTTCCAGGAGGCCATGGAGGAGAGCAAGACACACTTCCGCGCCGTGGACC
>hsa_circ_0000001|chr1:1080738-1080845-|None|None
GGTCGGCCATGAAGGTGGTGGGGGTCATGAGGTCACAAGGGGGTCGGCCATGTGATGGGGTTGGGTCAGCCGTGCGGTCAGGTCAGGTCGGCCATGAGGTCAGGTGG
>hsa_circ_0000002|chr1:1158623-1159348-|NM_016176|SDF4
CTGACGGGGACGGTCACGTGTCTTGGGACGAGTATAAGGTGAAGTTTTTGGCGAGTAAAGGCCATAGCGAGAAGGAGGTTGCCGACGCCATCAGGCTCAACGAGGAACTCAAAGTGGATGAGGAAAGGTGGATGTGAACACTGACCGGAAGATCAGTGCCAAGGAGATGCAGCGCTGGATCATGGAGAAGACGGCCGAGCACTTCCAGGAGGCCATGGAGGAGAGCAAGACACACTTCCGCGCCGTGGACC

$ more head.txt
hsa_circ_0000002

$ grep -Fw -A 1 -f head.txt t.fa
>hsa_circ_0000002|chr1:1158623-1159348-|NM_016176|SDF4
CTGACGGGGACGGTCACGTGTCTTGGGACGAGTATAAGGTGAAGTTTTTGGCGAGTAAAGGCCATAGCGAGAAGGAGGTTGCCGACGCCATCAGGCTCAACGAGGAACTCAAAGTGGATGAGGAAAGGTGGATGTGAACACTGACCGGAAGATCAGTGCCAAGGAGATGCAGCGCTGGATCATGGAGAAGACGGCCGAGCACTTCCAGGAGGCCATGGAGGAGAGCAAGACACACTTCCGCGCCGTGGACC
--
>hsa_circ_0000002|chr1:1158623-1159348-|NM_016176|SDF4
CTGACGGGGACGGTCACGTGTCTTGGGACGAGTATAAGGTGAAGTTTTTGGCGAGTAAAGGCCATAGCGAGAAGGAGGTTGCCGACGCCATCAGGCTCAACGAGGAACTCAAAGTGGATGAGGAAAGGTGGATGTGAACACTGACCGGAAGATCAGTGCCAAGGAGATGCAGCGCTGGATCATGGAGAAGACGGCCGAGCACTTCCAGGAGGCCATGGAGGAGAGCAAGACACACTTCCGCGCCGTGGACC
ADD REPLY
0
Entering edit mode

indeed (would have been strange otherwise), anyway

disk space?

ADD REPLY
0
Entering edit mode

I understand that only the headers are diplayed but not the DNA sequence(?). The very same command works on my machine.

what is the output of

file header.txt  test.fa

if it's not a pure ASCII file but a CR/LF file then you're workfing with windows files. https://en.wikipedia.org/wiki/Newline#Issues_with_different_newline_formats

ADD REPLY
0
Entering edit mode

I just get one fasta sequence from my whole header.txt file and this fasta sequence is present last in my header.txt file.

ADD REPLY
0
Entering edit mode

can you execute the command as Pierre Lindenbaum asked , and post the output of that here. thanks

ADD REPLY
0
Entering edit mode

on top of GenoMax comment : if only one, then which one? the first one? last one?

ADD REPLY
0
Entering edit mode

the last one.

ADD REPLY

Login before adding your answer.

Traffic: 2520 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6