Question

Extract aligned parts of sequence to reference to new file

0

Entering edit mode

7.5 years ago

claudia.paetzold • 0

I've already looked here and in other forums, but couldn't find the answer to my question.

I want to design baits for a target enrichment Sequencing approach and have the output of a MarkerMiner search for orthologous loci from four different genomes with A. thaliana as a Reference as. These output alignments are separate Fasta-Files for each A. thaliana annotated gene with the sequences from my datasets aligned to it. I have already run a script to filter out those loci supported to be orthologous by at least two of my four input datasets.

However, now, I'm stumped. My alignments are gappy, since the input data is mostly RNAseq whereas the Reference contains the introns as well. So it looks like this :

AT01G1234567 
ATCGATCGATGCGCGCTAGCTGAATCGATCGGATCGCGGTAGCTGGAGCTAGSTCGGATCGC 

MyData1
--CGATGCGCGC-----------CGGATCGCGG---------------CGGATCGC

MyData2
--CGCTGCGCGC------------GGATAGCGG---------------CGGATCCC

To effectively design baits I now need to extract all the aligned parts from the file, so that I will end up with separate files; or separate alignments within a new file; for the parts that are aligned between MyData and the Reference sequence with all the gappy parts excluded. There are about 1300 of these fasta files, so doing it manually is no option.

I have a bit of programming experience in python and with Linux command line tools, however I am completely lost on how to go about this. I would appreciate a hint, on what kind of tools are out there I could use or what kind of algorithm I need to come up with.

I thought about a position guided approach; i.e. defining the position on the reference sequence, where the aligned parts of myData starts and ends and then extracting that. But I dont know how to iterate over the alignment and determining if it is aligned or not. Any help is greatly appreciated.

Thank you.

Cheers

alignment extract python • 1.7k views

ADD COMMENT • link updated 7.4 years ago by Juan Manuel Berros ▴ 120 • written 7.5 years ago by claudia.paetzold • 0

score 0 · Answer 1 · 2016-11-23

Hey. I'm not sure I understand what kind of output you need. Using the text you provided, a simple grep command can keep the nucleotides and remove the gaps ("-") --though I'm not sure this is what you need:

$ cat /tmp/test

AT01G1234567 
ATCGATCGATGCGCGCTAGCTGAATCGATCGGATCGCGGTAGCTGGAGCTAGSTCGGATCGC 

MyData1
--CGATGCGCGC-----------CGGATCGCGG---------------CGGATCGC

MyData2
--CGCTGCGCGC------------GGATAGCGG---------------CGGATCCC

$ grep -o '[^-]*' /tmp/test

AT01G1234567 
ATCGATCGATGCGCGCTAGCTGAATCGATCGGATCGCGGTAGCTGGAGCTAGSTCGGATCGC 
MyData1
CGATGCGCGC
CGGATCGCGG
CGGATCGC
MyData2
CGCTGCGCGC
GGATAGCGG
CGGATCCC