Trimming an alignment around a given sequence
2
0
Entering edit mode
4 months ago
QLFblaireau ▴ 30

Hello, I have a simple question but actually I have troubles finding how to solve it. I have many alignment files. And I want for each of them get rid in the alignment of overhang bases. The twist is that I want this done relatively to a specific sequence, which is always the first of the alignment and is always named "Scaffoldxxxx" (where x are numbers and others). Here is a pic

what needs to be cropped is in the red squares

As you see, I want to trim everything that is upstream and downstream of the start and end of the first sequence. That's easy in a sequence editor such as Jalview. But as I have thousands of alignments, I need to automate it. Surprisingly I don't even know where to start.

Many thanks for any help or insight.

As requested, here is a sample alignment :

>Scaffold_2:57492774-57492872
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
-------------------------------cttgagctggggtctggccatggggtaaa
gaagcagcagcagagacagaccaatgccaatgaggattccatactgcacacagtcacaag
catgggtta---------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
-----------
>_R_TRINITY_DN28760_c0_g2_i7
tgtttgtgtgtcttgatttacaaaaatgatgcacagtaaatgttgataatttttacgact
gctgaggagatacaaggaacatggtaattgtgtaatgaagacaatgccagcttactaaat
gtattactttctgctgtgtgacaatgatacttacacgggtgcggcataaaactatctcaa
ctcctttccgttccccttccaacaccatgcttcgtataccatgtagactggagaagtcga
ggcccgatatgtggaccatgtccaggatgacggccctggggacgtcctcctccagggcca
ggcccatgaccttgtcctccaggtactctatggctggaaagaacaggccctggtcaggct
ggatcagcaccagccccggctgctccttcaccttgagttgtggccgggccatggggtaga
gcaggagcaacaaggacagcccgataccgatcaggatcccatactcaatgcccacacata
gtgaacccaaaaaggtggccacatgcacaaacagatcccatttattggtgcgccacaaaa
cggggatgattttgtagtcgaccatctgcatgacggccatgatgatgaccgcggccagcg
ctgacttggggatgtagtaacagtagggcaccaggaaggccagtactaacaggatcaggg
accctgtgaaaagaccattcgccggtgttcttacaccgctctgtgagttgacagcagttc
tggaaaaactgccggtgacaggataggaatgaacaaaggaactgagaatgttggcagtac
ctatagctatcaactcttgtgtaggatcaatcttatagttattcacacgagctgcaacat
aaacaaaagtgtcatgaatttcatatcagcgacaaaaacttttacctaataaataaagtt
ttaaaaaggag

I would need to trim this alignment so that its length is the length of the first sequence. I don't want in the second sequence what is upstream and downstream of the bases that match the first sequence.

pairwise script crop alignment • 415 views
ADD COMMENT
0
Entering edit mode

it would help better to understand the issue if you post the data instead of images.

ADD REPLY
0
Entering edit mode

I have updated accordingly.

ADD REPLY
0
Entering edit mode

first sequence is not in the second sequence.

ADD REPLY
3
Entering edit mode
3 months ago

If you knew how much to trim you can do it with any data trimmer, like cutadapt

figuring out how many to strip at start or end is a little trickier to do in a single command-line solution but it can be done with some scripting:

import sys

stream = sys.stdin

# Skip first sequence name.
line = next(stream)

seq = ''

line = next(sys.stdin)

# Collect the first sequence.
while not line.startswith(">"):
    seq += line.strip()
    line = next(stream)

lcount = len(seq) - len(seq.lstrip("-"))
rcount = len(seq) - len(seq.rstrip("-"))

print (lcount, rcount)

run it with:

cat foo.fa | python parse.py 

it prints:

391 422

now use cutadapt like so:

 cutadapt -u 391 -u -422  foo.fa > out.fa

produces:

>Scaffold_2:57492774-57492872
cttgagctggggtctggccatggggtaaagaagcagcagcagagacagaccaatgccaatgaggattccatactgcacacagtcacaagcatgggtta
>_R_TRINITY_DN28760_c0_g2_i7
cttgagttgtggccgggccatggggtagagcaggagcaacaaggacagcccgataccgatcaggatcccatactcaatgcccacacatagtgaaccca
ADD COMMENT
3
Entering edit mode
3 months ago

It's a two step procedure. Thanks Istvan Albert for making me under stand the query.

$ seqkit head -n 1 test.fa | seqkit locate -Pdip '(N+)' 

seqID   patternName pattern strand  start   end matched
Scaffold_2:57492774-57492872    (N+)    (N+)    +   392 489 cttgagctggggtctggccatggggtaaagaagcagcagcagagacagaccaatgccaatgaggattccatactgcacacagtcacaagcatgggtta

Note down the coordinates: 392-489 here

$ seqkit -w 0 subseq -r 392:489 test.fa

>Scaffold_2:57492774-57492872
cttgagctggggtctggccatggggtaaagaagcagcagcagagacagaccaatgccaatgaggattccatactgcacacagtcacaagcatgggtta
>_R_TRINITY_DN28760_c0_g2_i7
cttgagttgtggccgggccatggggtagagcaggagcaacaaggacagcccgataccgatcaggatcccatactcaatgcccacacatagtgaaccca

Download seqkit from here

ADD COMMENT
0
Entering edit mode

thanks a lot, I was thinking seqkit might have the solution but failed to find it. Thanks a lot.

ADD REPLY

Login before adding your answer.

Traffic: 1215 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6