Question

pick the longer contig splice variant if more than one splice variant exist

0

Entering edit mode

9.0 years ago

anahita ▴ 10

input file:

>comp145.0
AGAATATGTTCATGTGATCCACT
>comp36865.0
AGAATATGTTCATGTGATCCACTGATACACATCTCAAAAGTTTGACATTTTTTTCTTGTT
>comp36865.1
TTTTCTCCTTCCTCCCGTGTCGCGACACTCGCA

output:

>comp145.0
AGAATATGTTCATGTGATCCACT
>comp36865.0
AGAATATGTTCATGTGATCCACTGATACACATCTCAAAAGTTTGACATTTTTTTCTTGTT

if 2 splice variants exist, the longer one is chosen and written in the output file I need the answer in linux, at the moment I am using this code which is giving me the wrong output file

#!/bin/bash
cat sample_contig_tur_seqID_and_seq.fasta  |
awk '/^>/ {if(N>0) printf("\n"); printf("%s\t",$0);N++;next;} {printf("%s",$0);} END {if(N>0) printf("\n");}' | #linearize fasta
tr "." "\t" | #extract version from header
awk -F '    ' '{printf("%s\t%d\n",$0,length($3));}' | #extact length
sort -t '   ' -k1,1 -k4,4nr | #sort on name, inverse length
sort -t '   ' -k1,1 -u -s | #sort on name, unique, stable sort (keep previous order)
sed 's/    /./' | #restore name
cut -f 1,2 | #cut name, sequence
tr "\t" "\n"  | #back to fasta
fold -w 60 #pretty fasta

RNA-Seq • 1.1k views

ADD COMMENT • link 9.0 years ago by anahita ▴ 10

0

Entering edit mode

Hi ! you could try running your code one step at a time on a toy example to see what line causes the issue. And what do you mean by "wrong output file" ?

PS : I ran your code on your example and the output is fine for me. I just had to be careful when copy/pasting all those tabulation characters, probably because I'm on a Mac.

ADD REPLY • link 9.0 years ago by Carlo Yague 9.0k

score 1 · Answer 1 · 2016-07-05

1

Entering edit mode

9.0 years ago

anahita ▴ 10

Hi Carlo, thankx, I changed the tab for the sed command and now the output is produced and looks fine!! Thank you so much

ADD COMMENT • link 9.0 years ago by anahita ▴ 10