problem with filtering "Sequence unavailable"
0
0
Entering edit mode
5.1 years ago
ashkan ▴ 130

I have a file like the small example: small example:

>ENSG00000004142|ENST00000003607|POLDIP2|||2118
Sequence unavailable
>ENSG00000003056|ENST00000000412|M6PR|9099001;9102084|9099001;9102551|2756
CCAGGTTGTTTGCCTCTGGTCGGAAAGGGAAACTACCCCTGCTTCCACTCTGACAGCAGA


but I have too many "Sequence unavailable". I want to get rid of those transcripts. and the results would be like this:

>ENSG00000003056|ENST00000000412|M6PR|9099001;9102084|9099001;9102551|2756
CCAGGTTGTTTGCCTCTGGTCGGAAAGGGAAACTACCCCTGCTTCCACTCTGACAGCAGA


I tried to filter out those parts in bash but

grep -v "\$(grep -B 1 "Sequence unavailable" file.txt)" file.txt


but gave this error:

Argument list too long


how can i filter out them in bash or python?

sequence • 1.4k views
0
Entering edit mode

How about (should work as long as the first record is Sequence Unavailable, you can be creative otherwise): grep -A 2 "Sequence" your.fa | grep -v "\-\-" | sed -n '/Sequence/!p' > new.fa

0
Entering edit mode

It would be nice to provide feedback to the proposed solution of genomax2. In addition, you have more questions which you left "open/unsolved" after people tried to help you. That's not respectful.

I pledged to help you on your previous thread, but my questions remain unanswered, although it's clear that you have been active multiple times on biostars since my comment. You shouldn't take our help for granted.

0
Entering edit mode