I've hundreds of fasta files, each contain hundreds of sequences. Median sequence lengths are different between the files. I would like to remove sequences that are below median or 75%-tile length from each file. So far the scripts or tools I've came across such as USEARCH can only trim sequences based on user defined length. I'm looking for any useful ways to do the task including sed and awk. Any thought?
Using BBTools, you can remove sequences via length like this:
reformat.sh in=file.fa out=filtered.fa minlen=1000
You can get the distribution of lengths using the same program:
reformat.sh in=file.fa lhist=lhist.txt
...which will give you the number of sequences of any given length; you'd then need to process the resulting file to determine the X-percentile (it's 2-column tab-delimited). You can also get the L50 and N50 from stats.sh, which might be easier to parse. And readlength.sh also may have some useful features; it's slightly different. I think "lhist.txt" contains the mean, median, and mode in the header, which would be easy to parse as well.
I do not think you will find a tool that does exactly what you want, though, because it's a pretty odd request. Why do you want to do that?