Question: Passing a file to sed to remove all lines that match +1 in file B from file A
0
gravatar for sdbaney
10 weeks ago by
sdbaney0
sdbaney0 wrote:

Hi, I have a FASTA file that contains sequences from a de novo assembly. We have identified rRNA sequences that we are now trying to remove from that FASTA file. I can remove lines with sed but have to do it per line in a script and I have about 500 sequences that I need to remove. Is there a way that I can write this to take the matching sequences from file B (the rRNA sequences) and remove them and the line following (the actual sequence in the FASTA file) from file A? I have tried grep and comm but grep gives me a byte error and comm didn't make any difference to my files.

Any guidance would be greatly appreciated.

sed grep comm • 237 views
ADD COMMENTlink modified 10 weeks ago • written 10 weeks ago by sdbaney0
1

safer way using seqkit:

seqkit grep -v -f <fileb> input.fasta

ADD REPLYlink written 10 weeks ago by cpad011211k

I get it to run successfully and it prints the result to the screen but the result doesn't have anything from fileb removed from it. It's exactly the same as the input. Here is my code:

$ seqkit grep -v -f ~/Desktop/kettinAlignment-wwST-strict-3\ of\ 10hits.fasta ~/Desktop/kettinAlignment-wwST-strict-10hits.fasta
ADD REPLYlink modified 10 weeks ago • written 10 weeks ago by sdbaney0
1
gravatar for Noam Teyssier
10 weeks ago by
San Francisco
Noam Teyssier70 wrote:

You can try an fgrep approach which works fairly well

fgrep -v -f file_b.fa file_a.fa  > filtered.fa
ADD COMMENTlink written 10 weeks ago by Noam Teyssier70

It keeps returning "illegal byte sequence"

ADD REPLYlink written 10 weeks ago by sdbaney0
1

https://stackoverflow.com/a/19770395/8767800 May help out.

ADD REPLYlink written 10 weeks ago by Noam Teyssier70
1
gravatar for finswimmer
10 weeks ago by
finswimmer11k
Germany
finswimmer11k wrote:

Use seqkit grep:

$ cat input.fasta | seqkit grep -v -f list > new.fa

fin swimmer

ADD COMMENTlink written 10 weeks ago by finswimmer11k

Hi! so far this one has gotten me the furthest. I get an output file but it is the same as my input file. It hasn't removed any of the sequences that I list in the second file. Here is my code, do you notice anything incorrect?

cat ~/Desktop/kettinAlignment-wwST-strict-10hits.fasta | seqkit grep -v -f ~/Desktop/kettinAlignment-wwST-strict-3\ of\ 10hits.fasta > out3.fasta
ADD REPLYlink written 10 weeks ago by sdbaney0

Hello,

the file following the -f have to contain just the id's of the sequences you like to remove. If you just have a fasta file you can extract these id's with:

$ grep "^>" ~/Desktop/kettinAlignment-wwST-strict-3\ of\ 10hits.fasta|sed 's/^>//' > fastaids.txt

fin swimmer

ADD REPLYlink written 10 weeks ago by finswimmer11k

Oh okay! I can easily just have the IDs. Should it be a .txt file?

I performed the following and still returned a fasta file with all of the IDs, none taken out.

$ cat ~/Desktop/kettinAlignment-wwST-strict-10hits.fasta | seqkit grep -v -f ~/Desktop/remove.txt > new3.fasta

The text file with the IDs listed.. I tried it once including the > and once without to see if that is what was throwing it off but I still get the same result.

ADD REPLYlink modified 10 weeks ago • written 10 weeks ago by sdbaney0

Please show the first few lines of kettinAlignment-wwST-strict-10hits.fasta and remove.txt.

ADD REPLYlink written 10 weeks ago by finswimmer11k
$ cat ~/Desktop/kettinAlignment-wwST-strict-10hits.fasta 
>TRINITY_DN25900_c0_g2
CCGTTCTTTTGTACTTGTTATAATCTTTGAAGAAATCTGAGTTTGTTCATCCAGTGAGTG
AACAAGCTAAGATTTCTTCAAAGATTATAACAAGTACAAAAGAACGTAAAGAGGTTGTGT
CTGAAGAAACTAAGATTCAAATTGGAAAATATTAGTTTTTGCTTACTAGAAAAATGAATA
AATGTATGAACATTCATTTACAGTTTCAACAATGATGGTTATGCAGAAAGATTGGATAGT
TGGTAGTCTTTATGATCATGTGTTATCTATTGCCATTGTTCATCTCAAAATATTGATGAA
ATGCATCCAGGCCACTCCCCACTATTCATAGCATGTTTCCCTATTTCCTTCCCTATCTGT
GGAACCATATAAAAAGATAGTTCCACAATCAGAAGAAGTACACCTGAAATTAGCCAGTAC
ATCTGTTGTTCCTACAAAAGAAACTACAGTTGTTATTAGTGAAGAACACAAACCTGAAGA
GAAAGTATCAGTTGTTGTAGCAGAGTCACAAGTTGTGTCTGAAGAAAAGTGTTTGAAGAA
GTTCAATTTGAATATACAGCTGTTGCAACAAATGAATGTGGAAAAGTTACAACTTCAGCA
TACATCACAATTCTAGATCAAAGATGTTCCTTCACAAATGAAAATTAATATTGAATCTAA
ACAAGATTTCTCCAGAAAAAGCAATTGAACTTAAAAAGACAGAGAAAGTAGTTAAAAGAA
ADD REPLYlink modified 10 weeks ago by finswimmer11k • written 10 weeks ago by sdbaney0
$ cat ~/Desktop/remove.txt 
{\rtf1\ansi\ansicpg1252\cocoartf1561\cocoasubrtf600
{\fonttbl\f0\fswiss\fcharset0 Helvetica;}
{\colortbl;\red255\green255\blue255;}
{\*\expandedcolortbl;;}
\margl1440\margr1440\vieww10800\viewh8400\viewkind0
\pard\tx720\tx1440\tx2160\tx2880\tx3600\tx4320\tx5040\tx5760\tx6480\tx7200\tx7920\tx8640\pardirnatural\partightenfactor0

\f0\fs24 \cf0 >TRINITY_DN25900_c0_g2\

Hmmm... this is weird... When I open the txt file it just lists

>TRINITY_DN25900_c0_g2
>TRINITY_DN24782_c0_g2

I created a text file via the command line through nano and when I run it I still get my input back out. No sequences removed.

Here is the command

$ cat ~/Desktop/kettinAlignment-wwST-strict-10hits.fasta | seqkit grep -v -f ~/remove.txt > newfasta1.fasta

Here is the text file:

$ cat remove.txt 
>TRINITY_DN25900_c0_g2
>TRINITY_DN24782_c0_g1
ADD REPLYlink modified 10 weeks ago by finswimmer11k • written 10 weeks ago by sdbaney0

Remove the > at the line starts in remove.txt.

ADD REPLYlink written 10 weeks ago by finswimmer11k

Just use seqkit seq -n -i seqs.fa to retrieve IDs.

ADD REPLYlink written 10 weeks ago by shenwei3564.5k
0
gravatar for sdbaney
10 weeks ago by
sdbaney0
sdbaney0 wrote:

I wanted to give an update: I was able to accomplish this by following this post:

faSomeRecords

ADD COMMENTlink modified 10 weeks ago • written 10 weeks ago by sdbaney0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1813 users visited in the last hour