Question

remove sequences with 100% identity from a file having different ids

0

Entering edit mode

7.2 years ago

sanjalidhole • 0

i have a folder with 12 files each file is a txt file with below format: my single file format is as below tab seperated txt file," each id ,count and sequence is newline separated":

ta_iwgsc_2dl_v1_880448_4767 62385   auagcaucauccauccuaccc
ta_iwgsc_5dl_v1_4475147_17525   62385   auagcaucauccauccuaccc
ta_iwgsc_5ds_v1_2769792_21617   51267   ugaagcugccagcaugaucug
ta_iwgsc_2dl_v1_9826058_5702    16290   uuccaaagggaucgcauugau
ta_iwgsc_4dl_v3_14471626_15454  11824   auagcaucauccauccuacca
ta_iwgsc_4dl_v3_14415829_14746  11824   auagcaucauccauccuacca
ta_iwgsc_3ds_v1_2039022_12082   4161    gcucacccucucucugucagc

each file has different ids for same sequences. I need to extract all lines in a new file with a common sequence; for each sequence in each file in that folder. for example :

folder name:common
having 12 txt files with above format.

example file name:CC1 result should be a new file having format:

ta_iwgsc_2dl_v1_880448_4767 62385   auagcaucauccauccuaccc
ta_iwgsc_5dl_v1_4475147_17525   62385   auagcaucauccauccuaccc

file 2:

ta_iwgsc_4dl_v3_14471626_15454  11824   auagcaucauccauccuacca
ta_iwgsc_4dl_v3_14415829_14746  11824   auagcaucauccauccuacca

i am fine with perl ,R, python scripts. tools for extraction is also fyn.

sequence • 1.1k views

ADD COMMENT • link updated 7.2 years ago by shenwei356 8.5k • written 7.2 years ago by sanjalidhole • 0

0

Entering edit mode

I modified your post for readability with Markdown code make-up, using the 101010 button.

ADD REPLY • link 7.2 years ago by WouterDeCoster 47k

score 1 · Answer 1 · 2017-02-15

1

Entering edit mode

7.2 years ago

shenwei356 8.5k

cut -f 3 *.tsv | sort | uniq > uniq_seqs.txt

cat uniq_seqs.txt | while read seq <&0; do
    grep -w $seq *.tsv > $seq.txt
done

~~You can use csvtk uniq (usage)~~

~~for f in *.tsv; do csvtk uniq -H -t -f 3 $f > $f.uniq done~~

ADD COMMENT • link 7.2 years ago by shenwei356 8.5k

0

Entering edit mode

Damn it, for so many problems there is a neat solution using your tools :-D

ADD REPLY • link 7.2 years ago by WouterDeCoster 47k

0

Entering edit mode

Believe me, you'll like it

ADD REPLY • link 7.2 years ago by shenwei356 8.5k

score 0 · Answer 2 · 2017-02-15

0

Entering edit mode

7.2 years ago

WouterDeCoster 47k

Let's break this down:

You need to find the sequences which are duplicates

You could use the following code:

cut -f3 yourfile.txt | sort | uniq -d > duplidatesequences.txt

You need to filter the original file, generating new files in a loop. The filename generated is now equal to the sequence.

For which you could use the following loop:

for line in $(cat duplicatesequences.txt)
do
grep -w $line yourfile.txt > ${line}.txt
done

ADD COMMENT • link 7.2 years ago by WouterDeCoster 47k

0

Entering edit mode

thankyou i will try this

ADD REPLY • link 7.2 years ago by sanjalidhole • 0

0

Entering edit mode

will this work on a folder in which we have file1, file2,file3 etc seq1 in file1 is mapped for all seqs in file1; even to all seqs in file2 seqs,file3 seqs......etc and then output a final file of that sequence_name.txt

ADD REPLY • link 7.2 years ago by sanjalidhole • 0

0

Entering edit mode

I don't understand this additional question, please rephrase.

ADD REPLY • link 7.2 years ago by WouterDeCoster 47k

0

Entering edit mode

To be honest, I don't get it what you described.

ADD REPLY • link 7.2 years ago by shenwei356 8.5k

0

Entering edit mode

simple u have folder with some files. these files have a same format(mentioned above). take a sequence_1 from file_1 say "ugacugacugac" and find this sequence for all files in that folder. now extract all lines matching to "ugacugacugac" in a new file with file name as "ugacugacugac.txt". do this for all sequences in all files.