Question

Multiple alignment: find and reverse all complement strands

0

Entering edit mode

5.8 years ago

tomaz-biostars • 0

Hi,

I have 100.000 16S rRNA sequences (about 1500 bp) in one fasta file. About half are written as forward, and the rest are written in the reverse strand. What is the fastest way to find all the complements and reverse them, so that the single fasta file will have all the sequences written as forward.

Thank you for all the suggestions. Tomaž

alignment reverse complement • 2.0k views

ADD COMMENT • link 5.8 years ago by tomaz-biostars • 0

1

Entering edit mode

Blast them all against a reference sequence that is in 5'-3', use

-outfmt '6 qacc sstart send' | awk 'BEGIN{FS="\t"}{if($2>$3){print $1}}' > all_3_5.txt

ADD REPLY • link 5.8 years ago by 5heikki 11k

0

Entering edit mode

Hello Tomaž,

are there any information in the header of each sequence whether this is a reverse or forward sequence? Is each sequence in a single line or is it a multi-line fasta? Are the forward and reverse sequences paires or are they completly independend?

It would help if you could show as an extract of your file.

fin swimmer

ADD REPLY • link 5.8 years ago by finswimmer 16k

0

Entering edit mode

Hi, The header does not give any information. Each sequence is independent and these are not reverse pairs. The fasta is in in single line. Here is the extract:

>a0bda06a-5edc-410d-8997-c6df46553218 runid=a2d5899809987b8927384e3d9a3a2f5fbebdb425 read=813 ch=186 start_time=2018-06-03T20:45:25Z barcode=barcode12
GGGTTTAACCGGATAATAAGCCTACCGTGACCAGGTAAGAAGAAGCCTTGCTGCAGATCGGTTGGTTTACCTTCGTGGCGTCCCGCTCAGCAGGATTGAACTACCGACTTCAG...
>26739b86-ea20-416d-af6a-5612770ac1c4 runid=a2d5899809987b8927384e3d9a3a2f5fbebdb425 read=8776 ch=12 start_time=2018-06-03T21:08:13Z barcode=barcode12
TTCGTTCCAGTTCGTTTGGGTGTTTTAGCAGGTTAGCCGCTACCGTGACCAGGTAGAAAGAAGCCCAAGAATCAGGAAGAGTTTGATCCTGGCTCAGGATGAACGCTGGCGG...
>51ab6aa3-a17e-45f6-88dc-5f38498ce729 runid=a2d5899809987b8927384e3d9a3a2f5fbebdb425 read=9438 ch=331 start_time=2018-06-03T21:08:10Z barcode=barcode12
TTGGGTGTTTAGCCAGTTCGCCACCGTGACGGTAGAAGCAGAATCGGACGGTTACCTTGTTGGGCTTCCCCAGTCATTGGTCTTGCCTTAGACGGCTGGCTCCTTGCGAG...

ADD REPLY • link updated 5.8 years ago by finswimmer 16k • written 5.8 years ago by tomaz-biostars • 0

0

Entering edit mode

Hi,

thanks for answering my questions. It is better to format file contents by using the code button (the one 101 010). I did it for you now.

So, if there are no informations in the header about the strand and the reads are all independent: How do you know that have of them are reverse? I guess some kind of mapping is needed. But this leads us to the most important question: What are you trying do to?

fin swimmer

ADD REPLY • link 5.8 years ago by finswimmer 16k

0

Entering edit mode

Blasting against 16S database shows me I have sequences in both directions. I want to reverse complement half of them so that all of them are in the same forward direction, which I need for downstream processing.

ADD REPLY • link 5.8 years ago by tomaz-biostars • 0

0

Entering edit mode

well, if you have the blast result already, then select the IDs for which end < start (== the complement hits), extract those from your orginal fasta and then pass those through a rev-comp tool (EMBOSS? linux/bash? ....)

ADD REPLY • link 5.8 years ago by lieven.sterck 15k

0

Entering edit mode

Can I also advise to re-name your post title?

It's a bit misleading and you might miss out on people who could contribute if the title would be more accurate

ADD REPLY • link 5.8 years ago by lieven.sterck 15k