Question: How to remove header's tail of a multi-fasta file with sed or other
0
gravatar for tremblayemilie9
4.9 years ago by
Canada
tremblayemilie90 wrote:

Hi!

I have a multifasta file with read's headers such as:

>ITS1F_A_B10_R_2014_04_24_15_26_33_user_SN2-26_Run_2_for_its_oom_and_phyg_run2withbarcode.fastq_VG6RM_00181_00132
CCTGCGGAAGGATCATTAATGAAAATGTGTTGCCGGGGCCCATAATCCCGGCACTAACCTTCTTATCCATAACACCTGTGCACTGTTGGATGCTTGCATCCACTTTTATACTAAACAATTTGTAACAAATGTAGTCTTATTATAATTAATAAAACTTTTAACAACGGATCTCTTGGCTCTCGCATCGATGAAGAACGCAGC
>ITS1F_A_B10_R_2014_04_24_15_26_33_user_SN2-26-Run_2_for_its_oom_and_phyg_run2withbarcode.fastq_VG6RM_00171_00907
CCTGCGGAAGGATCATTACCGAGTTAGGGTCCTCTGGGGCCGAACCTCCCAACCCTGTGTCTATTGTTACCTTTTAGTTGCTTCGGCGGGCCGGCCGTCCTGACCAACTGGTCTCGCCGGCCGCCGGTCGTGGGTCTCCACGA

 

now, I would like to remove this tail part of my hearders where we get the sequence's id. I do not know how to do so for different tails for each reads.I thought of something like this:

sed s'/^.fastq/s/[^ ]* //'g

but it does not apply for some reason.

I would like to get something like this:

>ITS1F_A_B10_R_2014_04_24_15_26_33_user_SN2-26_Run_2_for_its_oom_and_phyg_run2withbarcode.fastq
CCTGCGGAAGGATCATTAATGAAAATGTGTTGCCGGGGCCCATAATCCCGGCACTAACCTTCTTATCCATAACACCTGTGCACTGTTGGATGCTTGCATCCACTTTTATACTAAACAATTTGTAACAAATGTAGTCTTATTATAATTAATAAAACTTTTAACAACGGATCTCTTGGCTCTCGCATCGATGAAGAACGCAGC
>ITS1F_A_B10_R_2014_04_24_15_26_33_user_SN2-26-Run_2_for_its_oom_and_phyg_run2withbarcode.fastq
CCTGCGGAAGGATCATTACCGAGTTAGGGTCCTCTGGGGCCGAACCTCCCAACCCTGTGTCTATTGTTACCTTTTAGTTGCTTCGGCGGGCCGGCCGTCCTGACCAACTGGTCTCGCCGGCCGCCGGTCGTGGGTCTCCACGA

 

 

 

 

sequence • 1.8k views
ADD COMMENTlink modified 4.9 years ago by RamRS24k • written 4.9 years ago by tremblayemilie90
4
gravatar for dariober
4.9 years ago by
dariober10k
WCIP | Glasgow | UK
dariober10k wrote:

What about:

sed 's/fastq_.*/fastq/' myseq.fa

Assuming the string "fastq_" occurs only at the end of the sequence name and everything after and including "_" will be stripped.

ADD COMMENTlink written 4.9 years ago by dariober10k
0
gravatar for tremblayemilie9
4.9 years ago by
Canada
tremblayemilie90 wrote:

Hi again,

I also have to remeve that sequence number from another file, but in that case, the sequence is in between...:

>barcodelabel= #ITS2_A_B10_R_2014_02_19_15_00_39_user_SN2-19-FUNGI_OOMYCETE-EMVSAMPLES_et_2014-02-19_RUN1_Fungi_oomycete_Run1_Ana140224.fastq_72JCK_00944_01804;size=52893;
CAGAACCAAGAGATCCGTTGTTGAAAGTTGTAACTATTATGTTTTTTCAGACGCTGATTGCAACTGCAAAGGGTTTGAAT
GTTGTCCAATCGGCGGGCGGACCCGCCGAGGAAACGAAGGTACTCAAAAGACATGGGTAAGAGGTAGCAGACCGAAGTCT
ACAAACTCTAGGTAATGATCCTTCCGCAGGTTCACCTACGGAAACCTTGTTACGACTTTTACTTCCTCTAAATGACCAAG
>barcodelabel= #ITS1F_A_B21_R_2014_02_19_15_00_39_user_SN2-19-FUNGI_OOMYCETE-EMVSAMPLES_et_2014-02-19_RUN1_Fungi_oomycete_Run1_Ana140224.fastq_72JCK_03245_02705;size=33771;
AAGTCGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTCATAATAAGTGTTTTATGGCACTTTTTAAATCCAT
ATCCACCTTGTGTGCAATGTCAGTCGGTCTTCTTTATGGAGATCGGCCAAACATCAACCTAATTTTTAACTCTTTGTCTG
AAAAATATTATGAATAAAATAATTCAAAATACAACTTTCAACAACGGATCTCTTGGCTCTCGCATCGATGAAGAACGCAG
C

 

So I want to keep the size=52893 part but remove the 72JCK_00944_01804 part.

 

 

ADD COMMENTlink written 4.9 years ago by tremblayemilie90

You might wanna start working on regular expressions more. These come best when you practice a bit. As long as you don't overwrite the file, nothing should go wrong in experimentation. 

In this case, you wanna match something that starts after a fastq_ and ends before the next ;

Should be easy enough to do that from the answer in your other question on the forum.

ADD REPLYlink written 4.9 years ago by RamRS24k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1669 users visited in the last hour