Question: Selecting defined length fasta sequence and excluding them from a dataset
0
gravatar for anis.geb.cu
12 months ago by
anis.geb.cu10
anis.geb.cu10 wrote:

I have a dataset of 5640 protein sequences in fasta format. I want to select and make a new dataset with the proteins having more than 100 amino acids. How can I do it using cmd command or Rstudio? What should be the script to do this? I badly need the answer!! Please help me.....

alignment sequence R • 668 views
ADD COMMENTlink modified 12 months ago by cpad011211k • written 12 months ago by anis.geb.cu10

What format is your data in already? Have you created a dataframe?

Please provide more information - ideally a minimal representative dataset (the first 10 lines of a dataframe with head() would do for instance).

ADD REPLYlink written 12 months ago by jrj.healey12k

I haven't created a separate dataframe. The data are in fasta format in a fasta file. For example, below are the first 10 sequences of this dataset........

>gi|229269494|gb|ACQ51130.1| transition state regulatory protein, AbrB (plasmid) [Bacillus anthracis str. A0248] MRNCRKNPIEIFVEDEKIILQKSKSYDACTITADISEKIIPLANRQIVLSSDGIELLIKEIQQHLVK

>gi|229269492|gb|ACQ51128.1| hypothetical protein BAA_A0045 (plasmid) [Bacillus anthracis str. A0248] MPSKKCGEDKLDDEGPVIVRYIKDSILDAEEYRYMEFKEVKGDHSIRSILSVVDQYVVAYLNERKKQFPG TIIWGITDKERKVVGTKLIHKERDELRRSIVERLDQIQPPIPPSSYKIDIKEVYDSSLKHIEDTYIVEVI VDTVTGNDLFATAKNEVYIKTDGGKKKLNHIQIQQEILLRNQHKRFNLK

>gi|229269491|gb|ACQ51127.1| pXO1-135 (plasmid) [Bacillus anthracis str. A0248] MKAERIFEGTSADGIWNKAIFKVKEGSYYYISENSTIEITDNEYLNVEVLEEVHQGEDHNLFAVKGDERD MLQQLEMDGFNFEDVDWGDLNLLDNEIMSSTDAIEEWGIESSTLRKRIKDFPKGSIRKIGTTYAVTRFGM RCVFGSNRK

>gi|229269490|gb|ACQ51126.1| UDP-glucose 6-dehydrogenase (plasmid) [Bacillus anthracis str. A0248] MNISIVGTGYVGLVTGVCLSEVGHNVTCIDIDEEKVKKMKLGYSPIFEPCLEELMKNNIIKGRLHFTTNY VDGADGAEIFYIAVGTPQKEDGSADLSFIKQAAINIARTIKNDVIIVVKSTVPVGTNIYIKNLILKNLNY DVKVDIISNPEFLREGSAVNDTFYGDRIVIGSENKTSANVMEEVYKPFGTPIFKTDIQSAEMIKYASNTF LATKISFINGIANLCEMVGADVEKVAQGMGQDKRIGSQFLNAGIGYGGSCFPKDTHALVKVSESLQHKFH LLESVIKLNKNQQTVLIEKIKKRFGSIVGKKIALLGLSFKPNTDDLREAPSIPIARKLVEEGAQVIAYDP VAIKNAREVLPKEVHYVYSTMEALTEADIALVLTEWDEVVDSLLLKASQLMKEPVIFDGRNCFELNEAKN YDVEYHSIGRPSVLREVKGEITA

>gi|229269489|gb|ACQ51125.1| transposase X (plasmid) [Bacillus anthracis str. A0248] MIIIDFLEERNPILKTRELLTDIQRTFFYKIPSQMDERELIRYYTLSDEELQIVNQQRGDHNRLGFAIQI SYPRFPGRPLLAKEKIPQFLVNFLAKQIGVASWEVQNYARTRDTTRREHVNKIRNLYNLRTFTLREYREL ARWLLPLAMKTENGYLLVEALIQEMRKRKIILPAVYAIEHLAWSVRERARKRIFKYLTKGLSSYQYEQLD TLLYTHEEKKSSLLSWLRQPPGVIALKNFHEILDRLEFIQSLDLPLDNGKEIHQNRLIQMAREGSRYSIQ HLSRFNERKRYATIMAFLIHIYAFLTDQGLDMFAKLMGRMFNRGENKHNKLFQKDGKSINEKVRLYVEIG KALIEAKELETDPFEAVQSIISWEKFIHSVHEAESLARPIEFDYLDLLDNHYGHFRKMAPRLLNRYVFKA SSTSQTVLQALELLKEANQSGKRKIPDNVPTDFIKSKWMKYVYQDDKINRHYYEFSVFSELLNTLRSGDM WVKGSKQYKDFEEYLLPPHTWQHMKQTQQIPLEFTEDVEKYLNKRFEQLDIELSKVNCLIANQDLQGVTI YGDKIQVASLKKMVPEDVEEITRLVYDLLPRIKLTDLLVEVDTWTQFSKHFVHLHTQKAPKEKSILFAAI LSDGINLGLSKMADACPNISYSQLAWIADWYIRDESYTKALGDIANFHHKQPFSSYWGEGTTSSSDGQFF RAGGTSSPLAQVNGKYGHDPGLSFYTHISDQYHPFYVQVISSSEEAPHIIDGLLYHETDLQIEEHYTDAA GFVDHIFGMCHMLGFRFSPRIKTIDNHKIYTLSVPTIYDHINFMVGGTIQVKKIRENWDEILRLVSSVRQ GTVTASLILKKLSSYPRQNSLCIALREIGRIERSLYMLKWIQDPEHRRKVQVELNKGESKNSLARAVFFN QLGEIRDRSYEDQLHRASGLQLIISAIVTWNSIYISRAIETLRNNGIHIPEEYIQHISPLGWEHIALTGD YIWNLNQELNFENLRPLRKNRIKQK

>gi|229269487|gb|ACQ51123.1| hypothetical protein BAA_A0023 (plasmid) [Bacillus anthracis str. A0248] MKRNKICTVLFSVALMFSVVFGAFSFPTGASARVMDENSGTYNAPPKSSDTIKFSVTPGYGHLKVFIKNR GQSNLYVSLKHVASGKVYFENKEIRVGDPALEWKSNKEGFPQGVKAGDFELSFSSGGKAAYLDWAYKSAD IIWP

>gi|229269485|gb|ACQ51121.1| conserved hypothetical protein (plasmid) [Bacillus anthracis str. A0248] MDENKRNMLLSFIISILFIFTSLLPFSNNEYVYVISKIGAAAGVINIVMMSVFLYFQSKKTGHLLN

>gi|229269484|gb|ACQ51120.1| spore germination protein GerXA (plasmid) [Bacillus anthracis str. A0248] MKRTVEVNESILRVWFEGCKDVKIMNRKWCADTTTTTILLVYCQHVIDHTKLKQAIAPEMCNDLLQSSFK DSNLLASNSQFSVTTLELENSNENVSRMLFEGKLLIIFQEYKRGYTIDIAKLPTRSIEQSNTEMTIRGSR DGFVEELSTNIGLIRKRLKTSSLSYDEFIIGERTQTKVGLLYLKDVASQETISQVQFKLKEINIDGVVSS AQIEEFITGDQFSLFPLIEYTGRPDYAVNCLLHGRFILLVDGSPTATIAPVSFPFFVNTAEDQNYFYLFG SFVRLLSLFGIAISIFLPGFWVALVTYHPDQIPYTLLATLSLSREGIPFPAPLEGMIMITLFELLRQAGL RIPAAFGQTLSVVGGLIIGQAAISSGFVSPSMVVMIAISVVSTFTLVNQSFTGTLSILRYGVFLMSSFLG IVGFICSILLIVIHVANLRSFGLPFLAPYSPPVFSSMLPSTFRIPFTRMKKRPKELHTYDNTRQRTNNDE NK

>gi|229269482|gb|ACQ51118.1| DNA topoisomerase 1 (plasmid) [Bacillus anthracis str. A0248] MGKTLFIAEKPKVANEIMKSPRFRHSQKYIGSKPYYGYYENDHYIVSWCRGHLLELKNPEEMDPKYKLFQ LEHLPLIFQPSYKVIQENAEQLQILVKLLQRPDVDHAVNICDADREGELIYREVYEYAGVNKKQSRVYKS SFEAAELEAALNRLESASKYDGLAYSAKARQYLDYLLGMNITRGCTTKLAQNKFLLSSGRVQMCLLHEIR QRELAIENYREQSYYHLQLITDLGLKPVMKTEDQVLNPSPLKSLGENLKDQYLTVEDFKEGTRKQNPKLL YNLTDLYKDAHAQLQINAETAKKHIQNLYEEGFITYPRSSSRHLPTEQVDRVKGVMQALAKSRYSLLVQS VDIDAIDIKHKTFDDDLVSSHFAIIPTTKQYQEEGRPEIEKQLYSLVVKRFVGNFMRPAVYLVRDVSLID AMGNTYQIKESVLREKGFLEVFQEEVKEESVETFKVPILQKGQELQIYDFELQESKTKKPALHTESSILT FMETAGRKIDDEHLKELMKGKRIGTVATEAAFIPVLHEKNFIDIEKGKIITTPIGRAFIEQFPVQQIKDP LYTAEMEGMIHRIEKNEMSYENFIAQTNAFVQKITQEIIRIPDTVSYNLIETWKKQIEVCQCPCGNGIIL KGEFTAAIRLKTDLSICFEFPTTDDRTVGKCPLCQSRVIIGKTNYLCEQYKRGCDFIVSGMILEKRITAS QIKKLLEKNMTDTVKGFVSKKTGKSFDAKLTYDSTQKRVTFIYEKKK

>gi|229269481|gb|ACQ51117.1| pXO1-43 (plasmid) [Bacillus anthracis str. A0248] MKLRQATHAVFTVGVACLFLYGERWKTIMNLQFASTTNGGVLSAREAFTNPELVLKHMTEEGATFSVSLQ DIQIPGKLNYYWYDVIIETATNKKIEVNIAVLSQEEYNQLCERDSIDLYLKQTEDAAKFFLEKSKGLFQP EFNNRFTFHHKAQLYL

I just want to use a command to select the protein sequences having 100 or less than hundred amino acid and exclude them from the dataset and get a new dataset containing remaining proteins having >100 amino acid.

ADD REPLYlink modified 12 months ago by genomax67k • written 12 months ago by anis.geb.cu10

For an R solution, see https://stackoverflow.com/questions/8640377/remove-all-rows-where-length-of-string-is-more-than-n for a hint.

Your data is in a horrible, horrible format at the moment however. You've no easy way of demarcating the headers from the sequences as they're separated by a single whitespace and both your headers and sequences also have spaces in.

ADD REPLYlink written 12 months ago by jrj.healey12k
0
gravatar for Hugo
12 months ago by
Hugo150
Universidade de Vigo, Ourense (Spain)
Hugo150 wrote:

You can use SEDA (http://www.sing-group.org/seda/), which has an option to filter sequences by length (http://www.sing-group.org/seda/manual/operations.html#filtering) as well as other types of filtering.

Regards.

ADD COMMENTlink written 12 months ago by Hugo150
1

Thank you very much hlfzeus!! I'm trying this....

ADD REPLYlink written 12 months ago by anis.geb.cu10
0
gravatar for Alex Reynolds
12 months ago by
Alex Reynolds28k
Seattle, WA USA
Alex Reynolds28k wrote:

Here's an awk-based approach, if your FASTA records are single-lined (header on one line, sequence on the next line, and so on):

$ awk '!/^>/ { next } { getline seq } length(seq) > 100 { print $0 "\n" seq }' input.fa > output.fa

If multi-lined (header on one line, sequence split across two or more lines), then the input.fa would need preprocessing to make it single-lined. There are other answers on Biostars that explain how to do this.

Via: http://itrylinux.com/use-awk-to-filter-fasta-file-by-minimum-sequence-length/

ADD COMMENTlink modified 12 months ago • written 12 months ago by Alex Reynolds28k

Explanation in that page is incorrect . Code can be shortened further, some thing like this:

$ awk '/^>/ { getline se } length(se) >100 { print $0 "\n" se }' test.fa

Example fasta: (copy/pasted from the page in the link):

 $ cat test.fa 
>sequence_ID_1
atcgatcgggatcaatgacttcattggagaccgaga
>sequence_ID_2
gatccatggacgtttaacgcgatgacatactaggatcagat

Ouptut from the code in linked page:

 $ awk '!/^>/ { next } { getline seq } length(seq) <37 { print $0 "\n" seq }' test.fa
>sequence_ID_1
atcgatcgggatcaatgacttcattggagaccgaga

output from the in this post:

$ awk '/^>/ { getline se } length(se) < 37 { print $0 "\n" se }' test.fa
>sequence_ID_1
atcgatcgggatcaatgacttcattggagaccgaga

Length of the sequences:

$ awk '!/^>/ { print length($NF) }' test.fa
36
41
ADD REPLYlink modified 12 months ago • written 12 months ago by cpad011211k

Is it incorrect?

$ awk '!/^>/ { next } { getline seq } length(seq) > 35 { print $0 "\n" seq "\n" length(seq) }' /tmp/in.fa
>sequence_ID_1
atcgatcgggatcaatgacttcattggagaccgaga
36
>sequence_ID_2
gatccatggacgtttaacgcgatgacatactaggatcagat
41

Seems like the filter works:

$ awk '!/^>/ { next } { getline seq } length(seq) > 36 { print $0 "\n" seq "\n" length(seq) }' /tmp/in.fa
>sequence_ID_2
gatccatggacgtttaacgcgatgacatactaggatcagat
41

Am I missing something?

ADD REPLYlink written 12 months ago by Alex Reynolds28k

Code is correct and uses double negation. Explanation in the page is incorrect.

!/^>/ { next } { getline seq }
  • !/^>/ - get lines that do not start with > (i.e sequences in fasta).
  • { next } - get next lines of lines that do not start with > (i.e sequence headers)
  • getline - get next lines next to the lines that do not start with > (i.e sequences in fasta- back to step1)

Instead it may have been :

/^>/ { getline seq }
  • /^>/ - get lines with > (i.e headers)
  • get line - get lines next to lines start with > (i.e sequences)

Explanation in the page for !/^>/ {next} is that: (copy/pasted)

If a line (i.e. record) begins with a “>”, go to the next line (record).
ADD REPLYlink modified 12 months ago • written 12 months ago by cpad011211k
0
gravatar for cpad0112
12 months ago by
cpad011211k
India
cpad011211k wrote:

For fitlering sequences with minimum 100 aa/nt length i.e > 100,

$ seqkit seq -m 100 test.fa

For filtering sequences with maximum of 1000 aa/nt length<=1000

 $ seqkit seq -M 1000 test.fa

For filtering sequences between maxiumum size 200 and minimum size of 100:

 $ seqkit seq -m 100 -M 200 test.fa

Download seqkit from here

ADD COMMENTlink modified 12 months ago • written 12 months ago by cpad011211k

I couldn't open seqkit in cmd, it couldn't read $ sign, how can i run seqkit in cmd? please show sequentially.....

ADD REPLYlink written 12 months ago by anis.geb.cu10

Don't include the $ character. That character is there just to remind you that this is a command-line task.

ADD REPLYlink written 12 months ago by Alex Reynolds28k

I have named my file as test.fa and placed it in C\Users\User directory and yet I find this......

C:\Users\User>seqkit seq -m 100 test.fa [ERRO] fastx: open test.fa: The system cannot find the file specified. C:\Users\User>

What can I do now?

ADD REPLYlink written 12 months ago by anis.geb.cu10

Install and use Cygwin or VirtualBox to install Linux on your Windows host, and use Linux to do your filtering.

ADD REPLYlink written 12 months ago by Alex Reynolds28k

what is OS version you are on and where is seqkit located (folder)? If you are on windows 10, you can use linux along with windows 10. (https://docs.microsoft.com/en-us/windows/wsl/install-win10). It is a long procedure. However it makes life easier. Once you install linux, try installing conda in linux. Conda/bioconda has most of the bioinformatics software and easy to install them.

ADD REPLYlink modified 12 months ago • written 12 months ago by cpad011211k

You will also need to install it...

ADD REPLYlink written 12 months ago by jrj.healey12k

For windows 7 and 10:

  1. seqkit installations are described here: https://bioinf.shenwei.me/seqkit/download/.
  2. Windows versions are named .tar.gz. Use a tool like IZarc to extract seqkit.exe (If you are on 64 bit OS, download 64 bit executable)
  3. Move seqkit.exe to C:\WINDOWS\system32 folder (as mentioned in the installation manual)
  4. Now go to the folder wherever the fasta file is present. Let us say Desktop.
  5. Hold shift button and right click on the same time in open area (white space or on freespace on desktop). Not on any file or directory.
  6. A right click menu appears with option: " Open powershell window here" (for windows 10) and "Open Command Window Here" for windows 7.. Click on it.
  7. A powershell/Command window appears and type seqkit.exe version. It should show seqkit version some thing like seqkit v0.8.0..
  8. If it doesn't show, then installation didn't go well. Try to read the installation manual for windows.
  9. Once it shows seqkit version, type dir in the terminal. It should show all the files. To list .fa files, type dir *.fa. It should show only .fa only. This should list your fasta file here.
  10. Now type in seqkit.exe seq -m 100 test.fa -o output.fa (replace test.fa with your file namem output.fa with output file name you want).
  11. Type again "dir". It should list new file.

For ubuntu:

  1. seqkit installations are described here: https://bioinf.shenwei.me/seqkit/download/.
    1. After installing seqkit, type seqkit version in the terminal.
  2. This would print the version of seqkit present on your system. If there are any errors, seqkit is either not in user path or there is installation problem. Refer to point 1.
  3. Once you have seqkit installed, navigate to the directory where your fasta file is located in the terminal. For eg. it is located on desktop and you are on linux/*nix/OS X, type "cd ~/Desktop".
  4. In ther terminal, type ls *.fa or ls.fasta or ls.fna depending on the the extension of fasta file. Seqkit can handle gzipped fasta file as well.
  5. You should see fasta file of interest here. Otherwise, you are in wrong directory. Refer to point 3.
  6. Once you see file of interest in your folder and you want to get all the sequences that are more than 100 amino acid length, type seqkit seq -m 100 test.fa -o output.fa (replace test.fa with your file namem output.fa with output file name you want).
  7. Repeat step 5. You should see output file in your current folder.
ADD REPLYlink modified 12 months ago • written 12 months ago by cpad011211k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 674 users visited in the last hour