Question

Extract subset sequences from fasta file

0

Entering edit mode

8.8 years ago

zhwjmch ▴ 170

I would like to extract a subset of fasta files based on the following two criteria: 1) seq length, 2) loc=2R and loc=2L; both of the information are indicated in their ID lines. Thanks in advance :)

Example: two sequences are listed below, I would like to extract all length longer than 300 AND loc=2R OR loc=2L

>FBtr0342981 type=untranslated_region; loc=2R:join(22136968..22137026,22138000..22138251,22162905..22162919);MD5=594e04c2c432646c24f2e6e22febd982; length=326; parent=FBgn0000008; release=r6.06; species=Dmel;
GTTCAATCTTTGTTTTCGTAGCGCGGC...GAAGTGAAGTAACCCATAAAACTAAC
>FBtr0083388 type=untranslated_region; loc=3R:complement(16828004..16828105);MD5=76d226b07edc8a389bce4902b3847b7a; length=102; parent=FBgn0000014; release=r6.06; species=Dmel;
GTTCAATCTTTGTTTTCGTAGCGCGGC...GAAGTGAAGTAACCCATAAAACTAAC

sequence • 4.2k views

ADD COMMENT • link updated 17 months ago by Ram 43k • written 8.8 years ago by zhwjmch ▴ 170

1

Entering edit mode

iterate through the lines in the file
if the line starts with > check if length > 300
- if it is, check if loc=2R or loc=2L and print until the next sequence header >
- if not, do nothing

ADD REPLY • link updated 4.5 years ago by Ram 43k • written 8.8 years ago by steven ▴ 70

Ram · Answer 1 · 2015-07-15

1

Entering edit mode

8.8 years ago

Brian Bushnell 20k

The BBMap package has a tool that will accomplish the first part:

reformat.sh in=file.fa out=filtered.fa minlen=300

There's another tool that would usually accomplish the second part, but unfortunately the = symbol is reserved. But judging by the headers, this would still work:

filterbyname.sh in=filtered.fa out=filtered2.fa substring=name names=2R,2L

That will give you things containing 2R or 2L, which is I think what you want, rather than 2R and 2L.

ADD COMMENT • link updated 4.5 years ago by Ram 43k • written 8.8 years ago by Brian Bushnell 20k

0

Entering edit mode

Sure, 2R or 2L, :) Thanks for recommend BBMap package.

ADD REPLY • link updated 17 months ago by Ram 43k • written 8.8 years ago by zhwjmch ▴ 170

Ram · Answer 2 · 2015-07-16

Who knows, it might work beyond your sample

#! /usr/bin/awk -f
BEGIN {FS=";"}

/\>.*; loc=2[LR];.*/
{
    for(i=1;i<=NF;i++){
        split($i,a,"=");
        defl[a[1]]=a[2]
    }
    if(300<defl[" length"]){
        record=$0;
        while(getline){
            if(NF<2){record = record "\n" $0}
            else{print record "\n"; NL--;break}
        }
    }
    delete defl;delete a; record=""
}

Ram · Answer 3 · 2015-10-26

0

Entering edit mode

8.5 years ago

Matt Shirley 10k

If your sequence lengths match the description in the defline, you can do:

$ pip install pyfaidx
$ faidx --size-range 300,1000000 --regex loc=2[RL] file.fa

ADD COMMENT • link updated 17 months ago by Ram 43k • written 8.5 years ago by Matt Shirley 10k