Entering edit mode
7.8 years ago
anahita
▴
10
input file:
>comp145.0
AGAATATGTTCATGTGATCCACT
>comp36865.0
AGAATATGTTCATGTGATCCACTGATACACATCTCAAAAGTTTGACATTTTTTTCTTGTT
>comp36865.1
TTTTCTCCTTCCTCCCGTGTCGCGACACTCGCA
output:
>comp145.0
AGAATATGTTCATGTGATCCACT
>comp36865.0
AGAATATGTTCATGTGATCCACTGATACACATCTCAAAAGTTTGACATTTTTTTCTTGTT
if 2 splice variants exist, the longer one is chosen and written in the output file I need the answer in linux, at the moment I am using this code which is giving me the wrong output file
#!/bin/bash
cat sample_contig_tur_seqID_and_seq.fasta |
awk '/^>/ {if(N>0) printf("\n"); printf("%s\t",$0);N++;next;} {printf("%s",$0);} END {if(N>0) printf("\n");}' | #linearize fasta
tr "." "\t" | #extract version from header
awk -F ' ' '{printf("%s\t%d\n",$0,length($3));}' | #extact length
sort -t ' ' -k1,1 -k4,4nr | #sort on name, inverse length
sort -t ' ' -k1,1 -u -s | #sort on name, unique, stable sort (keep previous order)
sed 's/ /./' | #restore name
cut -f 1,2 | #cut name, sequence
tr "\t" "\n" | #back to fasta
fold -w 60 #pretty fasta
Hi ! you could try running your code one step at a time on a toy example to see what line causes the issue. And what do you mean by "wrong output file" ?
PS : I ran your code on your example and the output is fine for me. I just had to be careful when copy/pasting all those tabulation characters, probably because I'm on a Mac.