Question: Extracting N positions from fasta file
2
gravatar for samuel.lipworth
3.6 years ago by
University of Oxford
samuel.lipworth30 wrote:

Hi I am very new to this so appologies if this is a simple question - I have been trying to figure this out for days to no avail and my python skills are not quite there yet!

I am trying to extract all N positions from novel bacterial sequences which have been aligned to a member of the same genus. I would like the start and end positions of all N motifs eg. GTCAGNNNNNTGGT

Is there an existing tool / how could I go about creating this in python?

Many thanks.

sequence • 2.0k views
ADD COMMENTlink modified 3.6 years ago by shenwei3565.2k • written 3.6 years ago by samuel.lipworth30

try regular expression: https://docs.python.org/2/library/re.html

ADD REPLYlink written 3.6 years ago by ahmedakhokhar110

Do you want to search for a specific pattern or for every location in which an 'N' is present?

ADD REPLYlink written 3.6 years ago by WouterDeCoster44k

every location at which a N is present. I have something like 36 different strains and would like to produce a list of N locations in each FASTA file and then compare these lists to find the unique N locations for each strain.

ADD REPLYlink written 3.6 years ago by samuel.lipworth30

So the output would be the chromosomal locations, right? Sounds like a job that can be done using Biopython. What have you tried?

ADD REPLYlink written 3.6 years ago by WouterDeCoster44k

If the sequence is not long you can do it without software, open the file fasta by the wordpad

ADD REPLYlink written 3.6 years ago by nora40

That's not very helpful.

ADD REPLYlink written 3.6 years ago by WouterDeCoster44k

thanks but the sequences are > 4 million bases

ADD REPLYlink written 3.6 years ago by samuel.lipworth30
2

Just use SeqKit. shenwei356 has even provided a detailed example below.

ADD REPLYlink written 3.6 years ago by genomax86k
5
gravatar for shenwei356
3.6 years ago by
shenwei3565.2k
China
shenwei3565.2k wrote:

Use SeqKit. SeqKit supports Windows/Linux/Mac OS X.

$ echo -en '>seq\nGTCAGNNNNNTGGT\n' | seqkit locate --ignore-case --only-positive-strand --pattern "N+" 
seqID   patternName     pattern strand  start   end     matched
seq     N+      N+      +       6       10      NNNNN

aligned result

$ echo -en '>seq\nGTCAGNNNNNTGGT\n' | seqkit locate --ignore-case --only-positive-strand --pattern "N+" | column -t
seqID  patternName  pattern  strand  start  end  matched
seq    N+           N+       +       6      10   NNNNN

or from file and save to file

$ seqkit locate --ignore-case --only-positive-strand --pattern "N+" seqs.fa > result.xls
ADD COMMENTlink modified 3.6 years ago • written 3.6 years ago by shenwei3565.2k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1624 users visited in the last hour