Bash command to search a fasta file for a sequence segment and to print the neighboring region of the match in every read
0
0
Entering edit mode
2.0 years ago
Human • 0

hello humans,

I'm trying to reconstruct a sequence based on a PoolSeq file of a population (in fasta format) and a conserved area. I want to search for matches with this sequence and then build up the neighboring area starting from that conserved sequence.

So I need a command to search for a match and print whatever comes next to it in that read. I need to print all reads, containing that sequence to BLAST them afterwards and find out if its on the gene of my interest. Bit by bit I want to reconstruct that gene (because it apparently differs from the NCBI version).

Can I do this with a Bash command and search a fasta file for a sequence segment and to print the neighboring region of the match in every read?

Input:

20~ bp Sequence

Output:

all reads containing that sequence segment with the neighboring region in that read

Bash Match Sequence print search • 995 views
ADD COMMENT
1
Entering edit mode

Dear Human,

reviewing your post history, you still seem to revolve around the same problem for a few weeks. I am sorry, we couldn't help you solve your problem despite the responses you have received so far. Your problem seems to be reconstructing a sex-specific gene from a pool of sequences (PoolSeq of unknown depth and technology specs, assuming the individuals are barcoded and their sex is known) from the genome or transcriptome of an unspecified species and sample. Your hypothesis is that the sex-specific gene or isoform in that unspecified species is different from the reference sequence and annotation and that the sex difference is detectable from your data. You are further assuming (based on speculation that would require validation first) that a 20bp conserved nucleotide sequence is sufficient and specific genome-wide for detecting the gene in question. Now you are looking for a method to support that hypothesis (even though it has limited if any support so far, at least none was presented).

Please allow me to be frank here: the reason why you haven't succeeded so far is in part because you keep asking the question in the wrong way. First, you have never disclosed all the details in a single post, and secondly and very importantly, you are limiting yourself and us by already proposing and trying to restrict tools and resources. So, try to forget for a moment about BLAST and a simple bash or perl script, because possibly the best solution is nothing like that. Once you are ready to accept that, we can work on the optimal solution to your problem which is possibly very different.

Best

Another human

ADD REPLY
0
Entering edit mode

This is probably not really a task for bash. It is theoretically possible, however it will be very clumsy code, even if you're looking for exact substring matches which it sounds like you aren't.

I would suggest that this is probably better achieved in biopython .

ADD REPLY
0
Entering edit mode

I will totally try that! What code approch would you suggest to fullfill this task. If you had to do it, how would you?

ADD REPLY
0
Entering edit mode

I'm not sure I yet fully understand what you're trying to achieve so don't think I can offer an approach, but it sounds like you need to do something like:

  1. Take query and subject sequence
  2. Do a local alignment (checkout the pairwise2 module within BioPython)
  3. Use the returned coordinates to extract via slicing notation the sequence surrounding it
ADD REPLY
1
Entering edit mode

I am not so sure, I think this whole alignment thought is an XY problem. I think op should rather look into standard pipelines for assembly and variant calling first before trying to build something non-standard.

After assembly, contigs can be mapped to the reference and reads to contigs to check for eventual differences.

After all, what was proposed by op sounds like a a simple seed based assembly approach, but there is not enough evidence that the 20bp sequence is sufficient nor specific enough to assemble the suspected region of variation. On the other hand the 20bp sequence seems to be important because op keeps bringing it up, so maybe there is prior evidence, e.g. from published variants, but then this needs to be explained.

I think the crux is to try to stick to established methods and protocols first, in particular if you are a beginner.

ADD REPLY

Login before adding your answer.

Traffic: 1363 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6