Question: fasta file extraction
0
gravatar for amitpande74
26 days ago by
amitpande740 wrote:

Hi, I have a fasta file formatted like this, product of paired end:

test_MAPQ.fasta->chr10:141146-141296
test_MAPQ.fasta:atgctcccattccaaatgagagtaattggctaaaacaaaggggctacaggtcccatacaagtccaaaacccaacagggcagtcattaaatcttTTCTAAttttaatttttattttatttgaagttctggggtacatgttcaggatgtata
test_MAPQ.fasta->chr10:142926-143076
--
test_MAPQ.fasta->chr10:146793-146943
test_MAPQ.fasta:gccAAATCATTACTTTTGAAGAAATAGTTAACAATGATTATTTCTTTTTGAATGACaataaattttattaataagttaaacatatttatatgtaatgtaaattttttGTATcgggtgcagtggttcatgcccgttatcctagcactttgg
test_MAPQ.fasta->chr10:146870-147020
test_MAPQ.fasta:aaacatatttatatgtaatgtaaattttttGTATcgggtgcagtggttcatgcccgttatcctagcactttgggaggccaaggtgttaatattgcttgagcaggggagtttgagaccagcctgggaaacatggtgaaacctcatatctac

I want an output where only both the pairs are included in the final results.

test_MAPQ.fasta->chr10:146793-146943
test_MAPQ.fasta:gccAAATCATTACTTTTGAAGAAATAGTTAACAATGATTATTTCTTTTTGAATGACaataaattttattaataagttaaacatatttatatgtaatgtaaattttttGTATcgggtgcagtggttcatgcccgttatcctagcactttgg
test_MAPQ.fasta->chr10:146870-147020
test_MAPQ.fasta:aaacatatttatatgtaatgtaaattttttGTATcgggtgcagtggttcatgcccgttatcctagcactttgggaggccaaggtgttaatattgcttgagcaggggagtttgagaccagcctgggaaacatggtgaaacctcatatctac

and the rest are ignored (like the upper pairs ending in dashes). Is there a tool which can filter out the results ? Kindly help.

awk sed fasta • 116 views
ADD COMMENTlink modified 26 days ago by cpad011214k • written 26 days ago by amitpande740

what is the basis of pairing from OP example? Number of lines before -- ? amitpande74.

$ awk -v RS="--" -v ORS="--" 'NF>3 {print}' test.fa

test_MAPQ.fasta->chr10:146793-146943
test_MAPQ.fasta:gccAAATCATTACTTTTGAAGAAATAGTTAACAATGATTATTTCTTTTTGAATGACaataaattttattaataagttaaacatatttatatgtaatgtaaattttttGTATcgggtgcagtggttcatgcccgttatcctagcactttgg
test_MAPQ.fasta->chr10:146870-147020
test_MAPQ.fasta:aaacatatttatatgtaatgtaaattttttGTATcgggtgcagtggttcatgcccgttatcctagcactttgggaggccaaggtgttaatattgcttgagcaggggagtttgagaccagcctgggaaacatggtgaaacctcatatctac

You can also do awk 'BEGIN{RS="--"} NF>3 {print} test.fa but it would create empty lines before and after the sequences.

ADD REPLYlink modified 26 days ago • written 26 days ago by cpad011214k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1688 users visited in the last hour