How to extract fasta headers that contains specific text and it's sequence
3
0
Entering edit mode
3.2 years ago
Optimist ▴ 180

Greetings!!!

I have barrnap output of 100 Pseudomonas aeruginosa genomes.

The output looks like this (sequences have been trimmed to avoid huge lines in biostars)

>16S_rRNA::Pseudomonas_aeruginosa_PAOC_Seq_1:6516148-6517679(-)
TGAAGAGTTTGATCATGGCTCAGATTGAACGCTGGCGGCAGGCCTAACACATGCAAGTCGAGCGGATGAAGGGAGCTTGCTCCTGGATTCAGCGGCGGACGGGTGAG
>23S_rRNA::Pseudomonas_aeruginosa_PAOC_Seq_1:6512786-6515674(-)
TCAAGTGAAGAAGCGCATACGGTGGATGCCTTGGCAGTCAGAGGCGATGAAAGACGTGGTAGCCTGCGAAAAGCT
>5S_rRNA::Pseudomonas_aeruginosa_PAOC_Seq_1:6512529-6512639(-)
TGACGATCATAGAGCGTTGGAACCACCTGATCCCTTCCCGAACTCAGAAGTGA

I want to extract only 16s rRNA headers and sequences from all the outputs.

Result output should look like this

>16S_rRNA::Pseudomonas_aeruginosa_PAOC_Seq_1:6516148-6517679(-) 
TGAAGAGTTTGATCATGGCTCAGATTGAACGCTGGCGGCAGGCCTAACACATGCAAGTCGAGCGGATGAAGGGAGCTTGCTCCTGGATTCAGCGGCGGACGGGTGAG

How can I get this output

Thank you all

barrnap 16srRNA fasta • 2.4k views
ADD COMMENT
2
Entering edit mode

What have you tried so far?

did you had a look at for instance seqtk or SeqKit ?

ADD REPLY
0
Entering edit mode

filter_fasta.py has worked.

ADD REPLY
0
Entering edit mode

with awk and flattend fasta:

awk '/^>16S/{print;getline;print}' seq.fa
ADD REPLY
0
Entering edit mode

Hi optimist,

can you share your script? how did you used barrnap on 100 P.aeruginosa genome ?

Thank you so much!

ADD REPLY
2
Entering edit mode
3.2 years ago
Optimist ▴ 180

I have found a solution to this.

seqkit rmdup -n -i -j <threads> <infilename> > outfilename

Thank you all for the responses

Cheers Have a great time!!!

ADD COMMENT
0
Entering edit mode

This would remove dups by sequence ID, not by sequence. Try using seqkit rmdup -s -i -j <threads> <infilename> > outfilename

ADD REPLY
2
Entering edit mode
3.2 years ago
5heikki 11k

If there are no linebreaks in the sequences I would do it like this:

paste - - <file.fasta | awk 'BEGIN{FS="\t";OFS="\n"}{if($1~/16S/){print $1,$2}}'

If you really have duplicate headers, i.e. identical strings then you can (note if the sequences under identical headers differ you will lose that information here):

paste - - <file.fasta | awk 'BEGIN{OFS=FS="\t"}{if($1~/16S/){print $0}}' | sort -t $'\t'-uk1,1 | awk 'BEGIN{FS="\t";OFS="\n"}{print $1,$2}'
ADD COMMENT
1
Entering edit mode
3.2 years ago
Fatima ▴ 1000

If each sequence is one and only one line:

grep -A 1 "16S_rRNA:" filename

Should do the trick!

ADD COMMENT
0
Entering edit mode

This has worked too. Thank You

is there a way to remove duplicate headers along with the associated fasta sequence

Looks like most of the genomes have more than 1 16SrRNA sequences (as expected).

ADD REPLY
0
Entering edit mode

You can use cd-hit with cutoff value of 1 (-c 1), it will remove the redundant sequences, but will also remove them if their header is different (as long as sequences are identical). So, for example when you have three identical sequences, the output file will only contain one of them ( the representative).

ADD REPLY
0
Entering edit mode

Thank you

This might not help because most of the 16S rRNA seqs would be conserved (identical).

ADD REPLY
0
Entering edit mode

Are the headers exactly identical?

grep "16S_rRNA" filename > headers 

#This is not very efficient but should do the work
cat headers | sort | uniq | while read line ; do grep -A 1 "${line}" filename >> output ; done
ADD REPLY

Login before adding your answer.

Traffic: 2409 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6