pick the longer contig splice variant if more than one splice variant exist
1
0
Entering edit mode
7.8 years ago
anahita ▴ 10

input file:

>comp145.0
AGAATATGTTCATGTGATCCACT
>comp36865.0
AGAATATGTTCATGTGATCCACTGATACACATCTCAAAAGTTTGACATTTTTTTCTTGTT
>comp36865.1
TTTTCTCCTTCCTCCCGTGTCGCGACACTCGCA

output:

>comp145.0
AGAATATGTTCATGTGATCCACT
>comp36865.0
AGAATATGTTCATGTGATCCACTGATACACATCTCAAAAGTTTGACATTTTTTTCTTGTT

if 2 splice variants exist, the longer one is chosen and written in the output file I need the answer in linux, at the moment I am using this code which is giving me the wrong output file

#!/bin/bash
cat sample_contig_tur_seqID_and_seq.fasta  |
awk '/^>/ {if(N>0) printf("\n"); printf("%s\t",$0);N++;next;} {printf("%s",$0);} END {if(N>0) printf("\n");}' | #linearize fasta
tr "." "\t" | #extract version from header
awk -F '    ' '{printf("%s\t%d\n",$0,length($3));}' | #extact length
sort -t '   ' -k1,1 -k4,4nr | #sort on name, inverse length
sort -t '   ' -k1,1 -u -s | #sort on name, unique, stable sort (keep previous order)
sed 's/    /./' | #restore name
cut -f 1,2 | #cut name, sequence
tr "\t" "\n"  | #back to fasta
fold -w 60 #pretty fasta
RNA-Seq • 1.0k views
ADD COMMENT
0
Entering edit mode

Hi ! you could try running your code one step at a time on a toy example to see what line causes the issue. And what do you mean by "wrong output file" ?

PS : I ran your code on your example and the output is fine for me. I just had to be careful when copy/pasting all those tabulation characters, probably because I'm on a Mac.

ADD REPLY
1
Entering edit mode
7.8 years ago
anahita ▴ 10

Hi Carlo, thankx, I changed the tab for the sed command and now the output is produced and looks fine!! Thank you so much

ADD COMMENT

Login before adding your answer.

Traffic: 2631 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6