String research in a fasta file
3
0
Entering edit mode
5.6 years ago
Greta • 0

Hi all,

I want to use grep command to search a string of 6 characters in every line of a fasta file. In particular I need to search these 6 characters in the first 30 characters of each line.

I report you an example: String to search: GTGTCA

Fasta File:

HWI-ST740:1:C2GCJACXX:1:1101:1279:1825 NGACGCTCTGACCTTGGGGCTGGTCGGGGATGCTGAGGAGACGGTGACCAGGGTTCCCTGGCCCCACANNNCCAAGCTTCCNNNNNNNNNNNNNNNNNNN HWI-ST740:1:C2GCJACXX:1:1101:1349:1847 NTTAGATGAGGGAAACATCTGCATCAAGTTGTTATCTGTGACAACAAGTGTTGTTCCACTGCCAAAGAGTTTCTTATAATAAAACAATCGGGGTGGCACNNNNNN

I want that the research is done only in the bold characters. So what I have to add in grep command to put the limit of 30 characters? Thank you

fasta grep • 2.6k views
ADD COMMENT
2
Entering edit mode

Did you tried any thing first?
"just suggestion- there much better and faster solutions "
what about

 grep -v "^HWI" < your_file|  cut -c-30 | grep -i "GTGTCA" > out.txt
ADD REPLY
0
Entering edit mode

Ok thank you very much! But If I want to recreate a fasta file with all the reads that contain the string, I have to add -B command, right? For example:

grep -B1 -i "GTGTCA" < my file | cut -c-30 > out.fa

is it right?

ADD REPLY
1
Entering edit mode

the -B1 option will print 1 line before each match. but you're not limiting grep to the first 30 characters as you were stating. that command with search in the whole file looking to entire lines, returning lines that matched plus previous ones, and then cut the first 30 characters of those results. doesn't make sense to me.

ADD REPLY
1
Entering edit mode

I'm reluctant to provide an answer, but cut -c1-30 will output characters 1-30 of each line in the input file.

ADD REPLY
1
Entering edit mode

yes you are right

  cut -c-30 or cut -c1-30

will do what you suggested

ADD REPLY
1
Entering edit mode

that's not exactly a fasta file. fasta format requires a ">" character at the begging of the each tag line.

ADD REPLY
1
Entering edit mode

The sequence header looks like it was formatted from fastq, but is missing the '>'.

ADD REPLY
3
Entering edit mode
5.6 years ago

Try the cross-platform and ultrafast FASTA/Q toolkit SeqKit:

Note that the sequences you provide is NOT FASTA format, > is missing

>HWI-ST740:1:C2GCJACXX:1:1101:1279:1825
NGACGCTCTGACCTTGGGGCTGGTCGGGGATGCTGAGGAGACGGTGACCAGGGTTCCCTGGCCCCACANNNCCAAGCTTCCNNNNNNNNNNNNNNNNNNN
>HWI-ST740:1:C2GCJACXX:1:1101:1349:1847
NTTAGATGAGGGAAACATCTGCATCAAGT

------ [update] -------

You can just do it with one command with SeqKit v0.3.6 now:

seqkit grep -s -R 1:30 -i -r -p TCTGACC -w 0 big.fa

See the long-option version for better understanding:

seqkit grep --by-seq --region 1:30 --ignore-case --use-regexp --pattern TCTGACC --line-width 0 big.fa

------ [previous solution] -------

Step 1. Extract the leading 30bp subsequences, search with sequence pattern and save the FASTA sequence IDs to file.

seqkit subseq -r 1:30 big.fa | seqkit grep -s -i -r -p TCTGACC | seqkit seq -n -i > ids.txt

Step 2. Search original FASTA file with the ID list.

seqkit grep -f ids.txt big.fa -w 0 > result.fa

Note: SeqKit support FASTQ format too, since HWI-ST740:1:C2GCJACXX:1:1101:1279:1825 is a FASTQ header line.

I can also add feature of searching sequences in specified sequence region to seqkit grep, so the job could be done in only one command.

ADD COMMENT
0
Entering edit mode

Ok thank you I will try! Yes I had a fastq file at the beginning but maybe something went wrong during the conversion in fasta!

ADD REPLY
0
Entering edit mode

It will be great if I could do everything in one line command!

ADD REPLY
1
Entering edit mode

Hi @Greta, I've added the feature now. You can just do it with one command with SeqKit v0.3.6 now:

seqkit grep -s -R 1:30 -i -r -p TCTGACC -w 0 big.fa

See the long-option version for better understanding:

seqkit grep --by-seq --region 1:30 --ignore-case --use-regexp --pattern TCTGACC --line-width 0 big.fa

Cheers!

ADD REPLY
2
Entering edit mode
5.6 years ago

this looks like grep homework, but I can't help providing a valid perl alternative:

perl -ne 'print "$prev$_" if substr($_, 0, 30) =~ /GTGTCA/; $prev = $_' input.fasta
ADD COMMENT
1
Entering edit mode
5.6 years ago
st.ph.n ★ 2.7k

Python snippet (provided this fasta has only one sequence line following each header)

 #!/usr/bin/env python

    with open("fa_file.fasta", 'r') as f: 
         with open('seq_found.fasta', 'w') as out:
              for line in f:
                   if line.startswith('>'):
                        head = line.strip()
                        seq = next(f).strip()[:31]
                        if 'GTGTCA' in seq:
                             print >> out, head, '\n', seq
ADD COMMENT
3
Entering edit mode

I would like to add a few comments to your solution. It's not bad but improvements are possible.

  1. Use Biopython SeqIO to parse fasta files and don't write your own parser if you don't have to. This also avoids assumptions about the structure of the fasta file
  2. You can open multiple files in one with statement using with open('file1') as f, open('file2') as f2:
  3. Why do you do line.strip() while you add a '\n' when writing to the output?
  4. Don't use the 'old' >> way of writing to a file, use (in this case) out.write(head + seq). As such the code is also portable to python3
ADD REPLY
0
Entering edit mode

Thanks WouterDeCoster, you had a similar comment on another post where I used the the same approach to open a fasta. I provided these examples so that the OP can see what's happening, and not just relying on Biopython to open the file.

  1. I agree assumptions would certainly be avoided with biopython
  2. Good point.
  3. I do line.strip() so the sequence can be sliced without the new line character.
  4. I"m still using 2.7, and haven't made the switch to 3. I guess I should start learning the syntax.
ADD REPLY
0
Entering edit mode

That's not a good motivation to use manual parsing over Biopython. Also, you are not slicing the head so the strip() doesn't have a function there.

And for most applications I also use python2.7 (without good reason). However, making code portable will save you a headache.

ADD REPLY
0
Entering edit mode

Do you mean every read in one line (header_space_sequence)?

ADD REPLY
0
Entering edit mode

No, what he means is every sequence on one line so

>header
sequence
>header2
sequence2

and not

>header
sequencepart1
sequencepart2
sequencepart3
>header2
sequence2part1
sequence2part2
sequence2part3

But as I wrote in my comment, by using Biopython SeqIO this assumption would be avoided

ADD REPLY
0
Entering edit mode

No he did not say that; the head variable already have "\n" new line at its end so when you write it in the output it will be header -> new line -> sequence

ADD REPLY
0
Entering edit mode

It wont because it's print >> out, header, '\n', seq and not :

print >> out, header
print >> out, '\n'
print >> out, seq
ADD REPLY

Login before adding your answer.

Traffic: 2475 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6