remove sequences with 100% identity from a file having different ids
2
0
Entering edit mode
7.2 years ago

i have a folder with 12 files each file is a txt file with below format: my single file format is as below tab seperated txt file," each id ,count and sequence is newline separated":

ta_iwgsc_2dl_v1_880448_4767 62385   auagcaucauccauccuaccc
ta_iwgsc_5dl_v1_4475147_17525   62385   auagcaucauccauccuaccc
ta_iwgsc_5ds_v1_2769792_21617   51267   ugaagcugccagcaugaucug
ta_iwgsc_2dl_v1_9826058_5702    16290   uuccaaagggaucgcauugau
ta_iwgsc_4dl_v3_14471626_15454  11824   auagcaucauccauccuacca
ta_iwgsc_4dl_v3_14415829_14746  11824   auagcaucauccauccuacca
ta_iwgsc_3ds_v1_2039022_12082   4161    gcucacccucucucugucagc

each file has different ids for same sequences. I need to extract all lines in a new file with a common sequence; for each sequence in each file in that folder. for example :

folder name:common
having 12 txt files with above format.

example file name:CC1 result should be a new file having format:

ta_iwgsc_2dl_v1_880448_4767 62385   auagcaucauccauccuaccc
ta_iwgsc_5dl_v1_4475147_17525   62385   auagcaucauccauccuaccc

file 2:

ta_iwgsc_4dl_v3_14471626_15454  11824   auagcaucauccauccuacca
ta_iwgsc_4dl_v3_14415829_14746  11824   auagcaucauccauccuacca

i am fine with perl ,R, python scripts. tools for extraction is also fyn.

sequence • 1.1k views
ADD COMMENT
0
Entering edit mode

I modified your post for readability with Markdown code make-up, using the 101010 button.

ADD REPLY
1
Entering edit mode
7.2 years ago
cut -f 3 *.tsv | sort | uniq > uniq_seqs.txt

cat uniq_seqs.txt | while read seq <&0; do
    grep -w $seq *.tsv > $seq.txt
done

You can use csvtk uniq (usage)

for f in *.tsv; do 
    csvtk uniq -H -t -f 3 $f > $f.uniq
done

ADD COMMENT
0
Entering edit mode

Damn it, for so many problems there is a neat solution using your tools :-D

ADD REPLY
0
Entering edit mode

Believe me, you'll like it

ADD REPLY
0
Entering edit mode
7.2 years ago

Let's break this down:

  1. You need to find the sequences which are duplicates

You could use the following code:

cut -f3 yourfile.txt | sort | uniq -d > duplidatesequences.txt
  1. You need to filter the original file, generating new files in a loop. The filename generated is now equal to the sequence.

For which you could use the following loop:

for line in $(cat duplicatesequences.txt)
do
grep -w $line yourfile.txt > ${line}.txt
done
ADD COMMENT
0
Entering edit mode

thankyou i will try this

ADD REPLY
0
Entering edit mode

will this work on a folder in which we have file1, file2,file3 etc seq1 in file1 is mapped for all seqs in file1; even to all seqs in file2 seqs,file3 seqs......etc and then output a final file of that sequence_name.txt

ADD REPLY
0
Entering edit mode

I don't understand this additional question, please rephrase.

ADD REPLY
0
Entering edit mode

To be honest, I don't get it what you described.

ADD REPLY
0
Entering edit mode

simple u have folder with some files. these files have a same format(mentioned above). take a sequence_1 from file_1 say "ugacugacugac" and find this sequence for all files in that folder. now extract all lines matching to "ugacugacugac" in a new file with file name as "ugacugacugac.txt". do this for all sequences in all files.

ADD REPLY
0
Entering edit mode

see my updated answer

ADD REPLY

Login before adding your answer.

Traffic: 1979 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6