Question: Replace _ in sequences with variant allele in fasta header
1
gravatar for affaid_tyj
3 months ago by
affaid_tyj10
affaid_tyj10 wrote:

Hello everyone,

I got a few flanking sequences around a bunch of SNPs. Since they were export from biomart, the variant allele was like a _:

>2|28804|28804|rs544393009|dbSNP|C/T
  TGCAT_TCAGC
>2|29358|29358|rs943169482|dbSNP|G/GAAAA
  CCCAA_AAGCC
>2|28396|28396|rs781239227|dbSNP|T/C
  AATGC_CTTGG

Can some experts help me to replace the _ in each fasta sequence with its header description? That is each fasta input will be turned in 2 or 3 flanking sequence:

>2|28804|28804|rs544393009|dbSNP|C
  TGCATCTCAGC
>2|28804|28804|rs544393009|dbSNP|T
  TGCATTTCAGC

And that is what I am desperately seeking for!!!

Thanks in advance!

genome • 326 views
ADD COMMENTlink modified 3 months ago • written 3 months ago by affaid_tyj10
1

Please reformat the sequences using 101010 button. It confused others!!!

Are these the original sequences?

>2|28804|28804|rs544393009|dbSNP|C/T
TGCAT_TCAGC
>2|29358|29358|rs943169482|dbSNP|G/GAAAA
CCCAA_AAGCC
>2|28396|28396|rs781239227|dbSNP|T/C
AATGC_CTTGG

Unformated FASTA will be shown as bellow:

2|28804|28804|rs544393009|dbSNP|C/T TGCAT_TCAGC 2|29358|29358|rs943169482|dbSNP|G/GAAAA CCCAA_AAGCC 2|28396|28396|rs781239227|dbSNP|T/C AATGC_CTTGG

Or, I guess you find they were mixed in one line, so you manually press ENTER between sequences, so it looks like this:

2|28804|28804|rs544393009|dbSNP|C/T TGCAT_TCAGC

2|29358|29358|rs943169482|dbSNP|G/GAAAA CCCAA_AAGCC

2|28396|28396|rs781239227|dbSNP|T/C AATGC_CTTGG

ADD REPLYlink modified 3 months ago • written 3 months ago by shenwei3563.4k

Thank you shenwei. I already edit my questions. Sorry about my careless.

ADD REPLYlink written 3 months ago by affaid_tyj10

Thanks Wei Shen; Thanks affaid_tyj. I will add a new answer to my current answer shortly.

ADD REPLYlink written 3 months ago by Kevin Blighe9.3k
2
gravatar for Kevin Blighe
3 months ago by
Kevin Blighe9.3k
London
Kevin Blighe9.3k wrote:

This was somewhat difficult and posed a challenge. Originally I started with just sed and awk but it got very complex with them.

Nevertheless, if the format of your input file is VERY consistent, the following should work (assuming your input is in snps.txt):

paste <(cut -f1 -d_ snps.txt  | sed 's/\/[ATGC]* / /g') <(sed 's/[0-9XY]*|[0-9]*|[0-9]*|rs[0-9]*|dbSNP|//g' snps.txt | sed 's/\/[ATGC]* [ATGC]*_[ATGC]*//g') <(cut -f2 -d_ snps.txt) -d ""

2|28804|28804|rs544393009|dbSNP|C TGCATCTCAGC
2|29358|29358|rs943169482|dbSNP|G CCCAAGAAGCC
2|28396|28396|rs781239227|dbSNP|T AATGCTCTTGG

...or, to break it down a bit:

sed 's/[0-9XY]*|[0-9]*|[0-9]*|rs[0-9]*|dbSNP|//g' snps.txt | sed 's/\/[ATGC]* [ATGC]*_[ATGC]*//g' > RefAlleles.list
paste <(cut -f1 -d_ snps.txt  | sed 's/\/[ATGC]* / /g') <(cat RefAlleles.list) <(cut -f2 -d_ snps.txt) -d ""
ADD COMMENTlink modified 3 months ago • written 3 months ago by Kevin Blighe9.3k

How is the code going to give for the right side of the / ? Since for some indels, just like sequence 2, the 2 alleles are like G vs GAAA .

ADD REPLYlink written 3 months ago by affaid_tyj10

Apologies, I have just considered the left side of the forward slash ('/'). Maybe this is a better answer:

Assuming your data is in snps.txt:

2|28804|28804|rs544393009|dbSNP|C/T TGCAT_TCAGC
2|29358|29358|rs943169482|dbSNP|G/GAAAA CCCAA_AAGCC
2|28396|28396|rs781239227|dbSNP|T/C AATGC_CTTGG


paste <(cut -f1 -d_ snps.txt | sed 's/\/[ATGC]* / /g') <(sed 's/[0-9XY]*|[0-9]*|[0-9]*|rs[0-9]*|dbSNP|//g' snps.txt | sed 's/\/[ATGC]* [ATGC]*_[ATGC]*//g') <(cut -f2 -d_ snps.txt) -d "" | sed 's/^/>/g' | sed 's/ /\n/g'
>2|28804|28804|rs544393009|dbSNP|C
TGCATCTCAGC
>2|29358|29358|rs943169482|dbSNP|G
CCCAAGAAGCC
>2|28396|28396|rs781239227|dbSNP|T
AATGCTCTTGG


paste <(cut -f1 -d"_" snps.txt | sed 's/[ATGC]*\///g') <(sed 's/[0-9XY]*|[0-9]*|[0-9]*|rs[0-9]*|dbSNP|[ATGC]*\///g' snps.txt | sed 's/ [ATGC]*_[ATGC]*//g') <(cut -f2 -d_ snps.txt) -d "" | sed 's/^/>/g' | sed 's/ /\n/g'
>2|28804|28804|rs544393009|dbSNP|T
TGCATTTCAGC
>2|29358|29358|rs943169482|dbSNP|GAAAA
CCCAAGAAAAAAGCC
>2|28396|28396|rs781239227|dbSNP|C
AATGCCCTTGG
ADD REPLYlink modified 3 months ago • written 3 months ago by Kevin Blighe9.3k

If the input is definitely FASTA, like this:

>2|28804|28804|rs544393009|dbSNP|C/T
  TGCAT_TCAGC
>2|29358|29358|rs943169482|dbSNP|G/GAAAA
  CCCAA_AAGCC
>2|28396|28396|rs781239227|dbSNP|T/C
  AATGC_CTTGG

Here is a fix to my original code (note that it assumes that that the sequences are short and fit on a single line:

First I crete a temporary file where I get the sequences and headers on 1 line again:

paste -s -d' \n' snps.fasta > snps.fasta.tmp

Then:

paste <(cut -f1 -d_ snps.fasta.tmp | sed 's/\/[ATGC]* / /g') <(sed 's/>[0-9XY]*|[0-9]*|[0-9]*|rs[0-9]*|dbSNP|//g' snps.fasta.tmp | sed 's/\/[ATGC]* [ATGC]*_[ATGC]*//g') <(cut -f2 -d_ snps.fasta.tmp) -d "" | sed 's/^//g' | sed 's/ /\n/g' > snps.allele1.fasta

cat snps.allele1.fasta

>2|28804|28804|rs544393009|dbSNP|C
TGCATCTCAGC
>2|29358|29358|rs943169482|dbSNP|G
CCCAAGAAGCC
>2|28396|28396|rs781239227|dbSNP|T
AATGCTCTTGG

paste <(cut -f1 -d"_" snps.fasta.tmp | sed 's/[ATGC]*\///g') <(sed 's/>[0-9XY]*|[0-9]*|[0-9]*|rs[0-9]*|dbSNP|[ATGC]*\///g' snps.fasta.tmp | sed 's/ [ATGC]*_[ATGC]*//g') <(cut -f2 -d_ snps.fasta.tmp) -d "" | sed 's/ /\n/g' > snps.allele2.fasta

cat snps.allele2.fasta

>2|28804|28804|rs544393009|dbSNP|T
TGCATTTCAGC
>2|29358|29358|rs943169482|dbSNP|GAAAA
CCCAAGAAAAAAGCC
>2|28396|28396|rs781239227|dbSNP|C
AATGCCCTTGG

Wei Sheng may have an easier solution with his seqkit program.

ADD REPLYlink modified 3 months ago • written 3 months ago by Kevin Blighe9.3k
2
gravatar for shenwei356
3 months ago by
shenwei3563.4k
China
shenwei3563.4k wrote:

That's not that hard by matching the SNPs and flanking sequences using a regular expression.

To make it easy to process, let's linearize the FASTA sequences, i.e. converting to tabular format. For example:

$ seqkit fx2tab seqs.fa 
2|28804|28804|rs544393009|dbSNP|C/T     TGCAT_TCAGC

Hereby, we can follow the original answer with little modification. Note that the 1 or 2 before allels are used to keeping order, which will be removed in the end.

# FASTA -> tabular
#     replace 'C/T\tTGCAT_TCAGC' with  '1C\tTGCATCTCAGC'
#     tabular -> FASTA
$ seqkit fx2tab seqs.fa | \
    perl -pe 's/(\w+)\/(\w+)\t(\w+)_(\w+)/1$1\t$3$1$4/' | \
    seqkit tab2fx > seqs.allel1.fa

# FASTA -> tabular
#     replace 'C/T\tTGCAT_TCAGC' with  '2T TGCATTTCAGC'
#     tabular -> FASTA
$ seqkit fx2tab seqs.fa | \
    perl -pe 's/(\w+)\/(\w+)\t(\w+)_(\w+)/2$2\t$3$2$4/' | \
    seqkit tab2fx > seqs.allel2.fa

Then sort records in the two files by sequence ID in alphabetical order

# sort by ID
# remove the 1/2 before alleles
$ seqkit sort seqs.allel*.fa |\
    perl -pe 's/\|\d(\w+)$/\|$1/'
>2|28396|28396|rs781239227|dbSNP|T
AATGCTCTTGG
>2|28396|28396|rs781239227|dbSNP|C
AATGCCCTTGG
>2|28804|28804|rs544393009|dbSNP|C
TGCATCTCAGC
>2|28804|28804|rs544393009|dbSNP|T
TGCATTTCAGC
>2|29358|29358|rs943169482|dbSNP|G
CCCAAGAAGCC
>2|29358|29358|rs943169482|dbSNP|GAAAA
CCCAAGAAAAAAGCC

Original answer:

That's not that hard by matching the SNPs and flanking sequences using a regular expression.

# replace 'C/T TGCAT_TCAGC' with  'C TGCATCTCAGC'
$ perl -pe 's/(\w+)\/(\w+) (\w+)_(\w+)/$1 $3$1$4/' seqs.fa > seqs.allel1.fa 
# replace 'C/T TGCAT_TCAGC' with  'T TGCATTTCAGC'
$ perl -pe 's/(\w+)\/(\w+) (\w+)_(\w+)/$2 $3$2$4/' seqs.fa > seqs.allel2.fa

If the order between two allels is not important, just sort them by sequence ID in alphabetical order:

$ seqkit sort seqs.allel*.fa
>2|28804|28804|rs544393009|dbSNP|C TGCATCTCAGC
ACTGN
>2|28804|28804|rs544393009|dbSNP|T TGCATTTCAGC
ACTGN
>2|29358|29358|rs943169482|dbSNP|G CCCAAGAAGCC
actgn
>2|29358|29358|rs943169482|dbSNP|GAAAA CCCAAGAAAAAAGCC
actgn

If the order is important, it can be done with a trick.

# replace 'C/T TGCAT_TCAGC' with  '1C TGCATCTCAGC'
$ perl -pe 's/(\w+)\/(\w+) (\w+)_(\w+)/1$1 $3$1$4/' seqs.fa > seqs.allel1.fa 
# replace 'C/T TGCAT_TCAGC' with  '2T TGCATTTCAGC'
$ perl -pe 's/(\w+)\/(\w+) (\w+)_(\w+)/2$2 $3$2$4/' seqs.fa > seqs.allel2.fa

# remove the 1/2 before allels after sort
$ seqkit sort seqs.allel*.fa | perl -pe 's/\|\d(\w+) /\|$1 /'
>2|28804|28804|rs544393009|dbSNP|C TGCATCTCAGC
ACTGN
>2|28804|28804|rs544393009|dbSNP|T TGCATTTCAGC
ACTGN
>2|29358|29358|rs943169482|dbSNP|G CCCAAGAAGCC
actgn
>2|29358|29358|rs943169482|dbSNP|GAAAA CCCAAGAAAAAAGCC
actgn
ADD COMMENTlink modified 3 months ago • written 3 months ago by shenwei3563.4k

It is so wise to take my sequence as fasta format and considered the order between each flanking sequences. Thank you so much for your help!!!

ADD REPLYlink written 3 months ago by affaid_tyj10

If it helps, please upvote and accept this answer.

ADD REPLYlink written 3 months ago by shenwei3563.4k

I don't believe that your result is what was requested.

ADD REPLYlink written 3 months ago by Kevin Blighe9.3k
1

Thanks for pointing out, I just notice that OP did not rightly format the code. Answer is updated.

ADD REPLYlink modified 3 months ago • written 3 months ago by shenwei3563.4k

Great to know you Wei Shen!

ADD REPLYlink written 3 months ago by Kevin Blighe9.3k

Hi Shenwei,

What if I want to replace _ with some other codes like - %? Can I still use w+ to represent them?

ADD REPLYlink written 3 months ago by affaid_tyj10
2

no, use .. you can "learn regular expression in 30 minutes".

by the way, use single back quote to quote inline code or special symbols.

ADD REPLYlink written 12 weeks ago by shenwei3563.4k

Hi Mr Shenwei,

I just want to say that seqkit is totally a miracle.

So powerful!

Thank you so much!

ADD REPLYlink written 3 months ago by affaid_tyj10
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1374 users visited in the last hour