Trimming an alignment around a given sequence
2
0
Entering edit mode
4 months ago
QLFblaireau ▴ 30

Hello, I have a simple question but actually I have troubles finding how to solve it. I have many alignment files. And I want for each of them get rid in the alignment of overhang bases. The twist is that I want this done relatively to a specific sequence, which is always the first of the alignment and is always named "Scaffoldxxxx" (where x are numbers and others). Here is a pic

As you see, I want to trim everything that is upstream and downstream of the start and end of the first sequence. That's easy in a sequence editor such as Jalview. But as I have thousands of alignments, I need to automate it. Surprisingly I don't even know where to start.

Many thanks for any help or insight.

As requested, here is a sample alignment :

>Scaffold_2:57492774-57492872
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
-------------------------------cttgagctggggtctggccatggggtaaa
gaagcagcagcagagacagaccaatgccaatgaggattccatactgcacacagtcacaag
catgggtta---------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
-----------
>_R_TRINITY_DN28760_c0_g2_i7
tgtttgtgtgtcttgatttacaaaaatgatgcacagtaaatgttgataatttttacgact
gctgaggagatacaaggaacatggtaattgtgtaatgaagacaatgccagcttactaaat
gtattactttctgctgtgtgacaatgatacttacacgggtgcggcataaaactatctcaa
ctcctttccgttccccttccaacaccatgcttcgtataccatgtagactggagaagtcga
ggcccgatatgtggaccatgtccaggatgacggccctggggacgtcctcctccagggcca
ggcccatgaccttgtcctccaggtactctatggctggaaagaacaggccctggtcaggct
ggatcagcaccagccccggctgctccttcaccttgagttgtggccgggccatggggtaga
gcaggagcaacaaggacagcccgataccgatcaggatcccatactcaatgcccacacata
gtgaacccaaaaaggtggccacatgcacaaacagatcccatttattggtgcgccacaaaa
cggggatgattttgtagtcgaccatctgcatgacggccatgatgatgaccgcggccagcg
ctgacttggggatgtagtaacagtagggcaccaggaaggccagtactaacaggatcaggg
accctgtgaaaagaccattcgccggtgttcttacaccgctctgtgagttgacagcagttc
tggaaaaactgccggtgacaggataggaatgaacaaaggaactgagaatgttggcagtac
ctatagctatcaactcttgtgtaggatcaatcttatagttattcacacgagctgcaacat
aaacaaaagtgtcatgaatttcatatcagcgacaaaaacttttacctaataaataaagtt
ttaaaaaggag


I would need to trim this alignment so that its length is the length of the first sequence. I don't want in the second sequence what is upstream and downstream of the bases that match the first sequence.

pairwise script crop alignment • 415 views
0
Entering edit mode

it would help better to understand the issue if you post the data instead of images.

0
Entering edit mode

I have updated accordingly.

0
Entering edit mode

first sequence is not in the second sequence.

3
Entering edit mode
3 months ago

If you knew how much to trim you can do it with any data trimmer, like cutadapt

figuring out how many to strip at start or end is a little trickier to do in a single command-line solution but it can be done with some scripting:

import sys

stream = sys.stdin

# Skip first sequence name.
line = next(stream)

seq = ''

line = next(sys.stdin)

# Collect the first sequence.
while not line.startswith(">"):
seq += line.strip()
line = next(stream)

lcount = len(seq) - len(seq.lstrip("-"))
rcount = len(seq) - len(seq.rstrip("-"))

print (lcount, rcount)


run it with:

cat foo.fa | python parse.py


it prints:

391 422


now use cutadapt like so:

 cutadapt -u 391 -u -422  foo.fa > out.fa


produces:

>Scaffold_2:57492774-57492872
cttgagctggggtctggccatggggtaaagaagcagcagcagagacagaccaatgccaatgaggattccatactgcacacagtcacaagcatgggtta
>_R_TRINITY_DN28760_c0_g2_i7
cttgagttgtggccgggccatggggtagagcaggagcaacaaggacagcccgataccgatcaggatcccatactcaatgcccacacatagtgaaccca

3
Entering edit mode
3 months ago

It's a two step procedure. Thanks Istvan Albert for making me under stand the query.

$seqkit head -n 1 test.fa | seqkit locate -Pdip '(N+)' seqID patternName pattern strand start end matched Scaffold_2:57492774-57492872 (N+) (N+) + 392 489 cttgagctggggtctggccatggggtaaagaagcagcagcagagacagaccaatgccaatgaggattccatactgcacacagtcacaagcatgggtta  Note down the coordinates: 392-489 here $ seqkit -w 0 subseq -r 392:489 test.fa

>Scaffold_2:57492774-57492872
cttgagctggggtctggccatggggtaaagaagcagcagcagagacagaccaatgccaatgaggattccatactgcacacagtcacaagcatgggtta
>_R_TRINITY_DN28760_c0_g2_i7
cttgagttgtggccgggccatggggtagagcaggagcaacaaggacagcccgataccgatcaggatcccatactcaatgcccacacatagtgaaccca


0
Entering edit mode

thanks a lot, I was thinking seqkit might have the solution but failed to find it. Thanks a lot.