fasta files: keep only lines containing certain characters + the respective sequences
3
0
Entering edit mode
6.8 years ago

Hi. I have a fasta file containing 18S and 16S sequences. Can I use sed or grep to find all headers containing 18S and print a new fasta file that includes only these headers and the respective sequences?

Edit: What I have:

>GQ118297.1 Tricorythus sp. BYU IGCEP206 16S ribosomal RNA gene, partial sequence; mitochondrial
TTGTCTCTTGGAAATATAGGAGATAGGGCCTGCCCAATGAAATTTCAATGGCCGCAGTAATTTGACTGTG
CAAAGGTAGCATAATCATTAGTTTTTTAATTGAGGACTGGTATGAAAGGCATAATGAGGTACTTGTTTTC
TTAAATAAAAGAATAAAATTTTACTTTTTAGTTAAAAGGCTAAAGTAATATAAAGGGACGAGAAGACCCT
ATAGAGTTTTATAAATTAAATTAATTTATTTTAGTAAAATTAAAGAATTGATGGAATTTATTTAGTTGGG
GAGATTTTGTAATAAAACTTATAATTATATAAACATTTATATATGATTTTAAGATCCATAAATTGATTAA
AAAATTAAATTACCTTAGGGATAACAGCGTTATTTTCTTGGAGAGTTCTTATCAATAGGAAAGTTTGCGA
CCTCGATGTTGGATTAAGAAAATAGTTAAATGAAGCCGTTTAATTAATAGGTCTGTTCGACCTTTAAAAT
CTTACA
>GQ118277.1 Teloganella sp. BYU IGCEP161 18S ribosomal RNA gene, partial sequence
GCTGTCTCAGTGCAAGCCTAATTAAAGTGAAACCGCAAATGGCTCATTAAATCAGTTATGGTTCATTAGA
TGATACATTATTTTACTTGGATAACTGTGGTAATTCTAGAGCTAATACATGCAAATTGAGTTCCCATCGG
TGACGGTAGGAACGCTTTTATTAGATCAAAACCAATACGTTTGCTTCGGCATACGATTTCAATGGTGATT
CTGAATAACTTTTTGATGATCGTACGGTCCTTGTATCGACGACAAATCTTTCAAATGTCTGCCTTATCAA
CTGTCGATGGTAGGCTCTGCGCCTACCATGGTTGTAACGGGTAACGGGGAATCAGGGTTCGATTCCGGAG
AGGGAGCCTGAGAAACGGCTACCACATCCAAGGAAGGCAGCAGGCGCGCAAATTACCCACTCCCGGCACG
GGGAGGTAGTGACGAAAAATAACGATACGGGACTCATCCGAGGCCCCGTAATCGGAATGAGMACACTTTA
AATMYTTTAACRAGTAYCYAWTGGAGGGCAAGTCTGGTGCCAGCAGCCGCGGTAATTCCAGCTCCATTGG
CGTATATTAAAGTTGTTGCGGTTAAAAAGCTCGGTAGTTGGTATCATGTGTMTCGGACGGTCGGTTCGCC
NCNCNCGGTGTTCAACTGACCGGTCCGGACGTCCTGCCGGTGGGACCCGGTTCGCGCCGGGCCCCGT

What I would like to have after processing by identifying all sequences wit "18S" in the header and printing them into a new file:

>GQ118277.1 Teloganella sp. BYU IGCEP161 18S ribosomal RNA gene, partial sequence
GCTGTCTCAGTGCAAGCCTAATTAAAGTGAAACCGCAAATGGCTCATTAAATCAGTTATGGTTCATTAGA
TGATACATTATTTTACTTGGATAACTGTGGTAATTCTAGAGCTAATACATGCAAATTGAGTTCCCATCGG
TGACGGTAGGAACGCTTTTATTAGATCAAAACCAATACGTTTGCTTCGGCATACGATTTCAATGGTGATT
CTGAATAACTTTTTGATGATCGTACGGTCCTTGTATCGACGACAAATCTTTCAAATGTCTGCCTTATCAA
CTGTCGATGGTAGGCTCTGCGCCTACCATGGTTGTAACGGGTAACGGGGAATCAGGGTTCGATTCCGGAG
AGGGAGCCTGAGAAACGGCTACCACATCCAAGGAAGGCAGCAGGCGCGCAAATTACCCACTCCCGGCACG
GGGAGGTAGTGACGAAAAATAACGATACGGGACTCATCCGAGGCCCCGTAATCGGAATGAGMACACTTTA
AATMYTTTAACRAGTAYCYAWTGGAGGGCAAGTCTGGTGCCAGCAGCCGCGGTAATTCCAGCTCCATTGG
CGTATATTAAAGTTGTTGCGGTTAAAAAGCTCGGTAGTTGGTATCATGTGTMTCGGACGGTCGGTTCGCC
NCNCNCGGTGTTCAACTGACCGGTCCGGACGTCCTGCCGGTGGGACCCGGTTCGCGCCGGGCCCCGT
fasta sed grep • 3.7k views
ADD COMMENT
1
Entering edit mode

Can I use sed or grep to find all headers containing 18S and print a new fasta file that includes only these headers and the respective sequences?

no, you need something like awk.

ADD REPLY
1
Entering edit mode

Whether or not you can do this depends on the contents of the file, and how precise you need to be. In the general case, no, it is not possible. I suggest you post the headers of some of your sequences, along with as much other data as possible. And clarify whether you wish to separate sequences based on their header or contents.

ADD REPLY
1
Entering edit mode

if you know a little bit of python, Biopython SeqIO would be an easy and intuitive solution.

ADD REPLY
0
Entering edit mode

Unfortunately, I don't (yet), so Python does not help me at the moment

ADD REPLY
0
Entering edit mode

Agreed, BioPython is totally the way to go for this

ADD REPLY
0
Entering edit mode

Tmatamatamas : You have multiple options of correct answers below. Please "accept" (green check mark) as many as you wish to provide closure to this thread.

ADD REPLY
0
Entering edit mode

Hi, if you want to use grep, just try:

grep -w "18S" -A 1 input.fasta > 18S.fasta

parameter A means, that grep print one lines of trailing context. Parameter w match only whole word -in our case 18S.

EDIT: This works only if you have in header 18S and no-where else. Also you need correct format of FASTA file - first line started with > is header and second line nucleotides.

ADD REPLY
0
Entering edit mode

OP wants to keep the entire sequence where the header contains 18S word.

Edit: Moved to a comment since it is not an appropriate answer for the original question.

ADD REPLY
0
Entering edit mode

This is a Downvote:)

ADD REPLY
0
Entering edit mode

This would work if the fasta was a single-line fasta and not multi-line like the OP has.

ADD REPLY
0
Entering edit mode

st.ph.n, thank you for the clarity on the above answer. Is there any way to do this with grep?

ADD REPLY
1
Entering edit mode

Unless the fasta's are linearised, as in the above comment, then no not really. I might be possible, but would be ugly and clunky I'm sure.

ADD REPLY
4
Entering edit mode
6.8 years ago
awk '/^>/ { ok=index($0,"18S")!=0;} {if(ok) print;}'  input.fa
ADD COMMENT
0
Entering edit mode

That's it, thanks a lot!

ADD REPLY
0
Entering edit mode

Hi, I am sorry to ask after so long, but, could be possible to explain a little how it works. Thanks

ADD REPLY
1
Entering edit mode

/^>/

for lines starting with '>' ...

index($0,"18S") check the index of "18S" in the whole line

ok=index($0,"18S")!=0; set the flag ok=true if 18S is present in the whole line

{if(ok) print;} for all lines print the line if ok==true

ADD REPLY
3
Entering edit mode
6.8 years ago

Please Pierre Lindenbaum, don't underestimate sed :-)

sed -n '/^>.*18S/,/^>/ {/^>.*18S/p ; /^>/! p}' input.fa
ADD COMMENT
0
Entering edit mode

excellent ! :-)

ADD REPLY
0
Entering edit mode
6.8 years ago

my fasta:

 cat test2.fa 
>GQ118297.1 Tricorythus sp. BYU IGCEP206 16S ribosomal RNA gene, partial sequence; mitochondrial
TTGTCTCTTGGAAATATAGGAGATAGGGCCTGCCCAATGAAATTTCAATGGCCGCAGTAATTTGACTGTG
CAAAGGTAGCATAATCATTAGTTTTTTAATTGAGGACTGGTATGAAAGGCATAATGAGGTACTTGTTTTC
TTAAATAAAAGAATAAAATTTTACTTTTTAGTTAAAAGGCTAAAGTAATATAAAGGGACGAGAAGACCCT
ATAGAGTTTTATAAATTAAATTAATTTATTTTAGTAAAATTAAAGAATTGATGGAATTTATTTAGTTGGG
GAGATTTTGTAATAAAACTTATAATTATATAAACATTTATATATGATTTTAAGATCCATAAATTGATTAA
AAAATTAAATTACCTTAGGGATAACAGCGTTATTTTCTTGGAGAGTTCTTATCAATAGGAAAGTTTGCGA
CCTCGATGTTGGATTAAGAAAATAGTTAAATGAAGCCGTTTAATTAATAGGTCTGTTCGACCTTTAAAAT
CTTACA
>GQ118277.1 Teloganella sp. BYU IGCEP161 18S ribosomal RNA gene, partial sequence
GCTGTCTCAGTGCAAGCCTAATTAAAGTGAAACCGCAAATGGCTCATTAAATCAGTTATGGTTCATTAGA
TGATACATTATTTTACTTGGATAACTGTGGTAATTCTAGAGCTAATACATGCAAATTGAGTTCCCATCGG
TGACGGTAGGAACGCTTTTATTAGATCAAAACCAATACGTTTGCTTCGGCATACGATTTCAATGGTGATT
CTGAATAACTTTTTGATGATCGTACGGTCCTTGTATCGACGACAAATCTTTCAAATGTCTGCCTTATCAA
CTGTCGATGGTAGGCTCTGCGCCTACCATGGTTGTAACGGGTAACGGGGAATCAGGGTTCGATTCCGGAG
AGGGAGCCTGAGAAACGGCTACCACATCCAAGGAAGGCAGCAGGCGCGCAAATTACCCACTCCCGGCACG
GGGAGGTAGTGACGAAAAATAACGATACGGGACTCATCCGAGGCCCCGTAATCGGAATGAGMACACTTTA
AATMYTTTAACRAGTAYCYAWTGGAGGGCAAGTCTGGTGCCAGCAGCCGCGGTAATTCCAGCTCCATTGG
CGTATATTAAAGTTGTTGCGGTTAAAAAGCTCGGTAGTTGGTATCATGTGTMTCGGACGGTCGGTTCGCC
NCNCNCGGTGTTCAACTGACCGGTCCGGACGTCCTGCCGGTGGGACCCGGTTCGCGCCGGGCCCCGT

Command:

seqkit grep -r -n -i -p  18s test2.fa

output:

 seqkit grep -r -n -i -p  18s test2.fa 
>GQ118277.1 Teloganella sp. BYU IGCEP161 18S ribosomal RNA gene, partial sequence
GCTGTCTCAGTGCAAGCCTAATTAAAGTGAAACCGCAAATGGCTCATTAAATCAGTTATG
GTTCATTAGATGATACATTATTTTACTTGGATAACTGTGGTAATTCTAGAGCTAATACAT
GCAAATTGAGTTCCCATCGGTGACGGTAGGAACGCTTTTATTAGATCAAAACCAATACGT
TTGCTTCGGCATACGATTTCAATGGTGATTCTGAATAACTTTTTGATGATCGTACGGTCC
TTGTATCGACGACAAATCTTTCAAATGTCTGCCTTATCAACTGTCGATGGTAGGCTCTGC
GCCTACCATGGTTGTAACGGGTAACGGGGAATCAGGGTTCGATTCCGGAGAGGGAGCCTG
AGAAACGGCTACCACATCCAAGGAAGGCAGCAGGCGCGCAAATTACCCACTCCCGGCACG
GGGAGGTAGTGACGAAAAATAACGATACGGGACTCATCCGAGGCCCCGTAATCGGAATGA
GMACACTTTAAATMYTTTAACRAGTAYCYAWTGGAGGGCAAGTCTGGTGCCAGCAGCCGC
GGTAATTCCAGCTCCATTGGCGTATATTAAAGTTGTTGCGGTTAAAAAGCTCGGTAGTTG
GTATCATGTGTMTCGGACGGTCGGTTCGCCNCNCNCGGTGTTCAACTGACCGGTCCGGAC
GTCCTGCCGGTGGGACCCGGTTCGCGCCGGGCCCCGT
ADD COMMENT
0
Entering edit mode

seqkit can be found here.

ADD REPLY

Login before adding your answer.

Traffic: 2573 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6