Question: Extract aligned parts of sequence to reference to new file
0
gravatar for claudia.paetzold
13 months ago by
claudia.paetzold0 wrote:

I've already looked here and in other forums, but couldn't find the answer to my question.

I want to design baits for a target enrichment Sequencing approach and have the output of a MarkerMiner search for orthologous loci from four different genomes with A. thaliana as a Reference as. These output alignments are separate Fasta-Files for each A. thaliana annotated gene with the sequences from my datasets aligned to it. I have already run a script to filter out those loci supported to be orthologous by at least two of my four input datasets.

However, now, I'm stumped. My alignments are gappy, since the input data is mostly RNAseq whereas the Reference contains the introns as well. So it looks like this :

AT01G1234567 
ATCGATCGATGCGCGCTAGCTGAATCGATCGGATCGCGGTAGCTGGAGCTAGSTCGGATCGC 

MyData1
--CGATGCGCGC-----------CGGATCGCGG---------------CGGATCGC

MyData2
--CGCTGCGCGC------------GGATAGCGG---------------CGGATCCC

To effectively design baits I now need to extract all the aligned parts from the file, so that I will end up with separate files; or separate alignments within a new file; for the parts that are aligned between MyData and the Reference sequence with all the gappy parts excluded. There are about 1300 of these fasta files, so doing it manually is no option.

I have a bit of programming experience in python and with Linux command line tools, however I am completely lost on how to go about this. I would appreciate a hint, on what kind of tools are out there I could use or what kind of algorithm I need to come up with.

I thought about a position guided approach; i.e. defining the position on the reference sequence, where the aligned parts of myData starts and ends and then extracting that. But I dont know how to iterate over the alignment and determining if it is aligned or not. Any help is greatly appreciated.

Thank you.

Cheers

extract alignment python • 432 views
ADD COMMENTlink modified 12 months ago by Juan Manuel Berros60 • written 13 months ago by claudia.paetzold0
0
gravatar for Juan Manuel Berros
12 months ago by
Buenos Aires, Argentina
Juan Manuel Berros60 wrote:

Hey. I'm not sure I understand what kind of output you need. Using the text you provided, a simple grep command can keep the nucleotides and remove the gaps ("-") --though I'm not sure this is what you need:

$ cat /tmp/test

AT01G1234567 
ATCGATCGATGCGCGCTAGCTGAATCGATCGGATCGCGGTAGCTGGAGCTAGSTCGGATCGC 

MyData1
--CGATGCGCGC-----------CGGATCGCGG---------------CGGATCGC

MyData2
--CGCTGCGCGC------------GGATAGCGG---------------CGGATCCC

$ grep -o '[^-]*' /tmp/test

AT01G1234567 
ATCGATCGATGCGCGCTAGCTGAATCGATCGGATCGCGGTAGCTGGAGCTAGSTCGGATCGC 
MyData1
CGATGCGCGC
CGGATCGCGG
CGGATCGC
MyData2
CGCTGCGCGC
GGATAGCGG
CGGATCCC
ADD COMMENTlink written 12 months ago by Juan Manuel Berros60
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1425 users visited in the last hour