File Without Redundancy Using Awk
4
0
Entering edit mode
10.9 years ago
thiagomafra ▴ 70

Hi people,

i have two fasta file with redundant reads. I want one file with all reads without redundancy, using awk. Someone help me?

awk reads • 3.2k views
ADD COMMENT
5
Entering edit mode

Can I encourage users not to answer questions which fail the "what have you tried" test.

ADD REPLY
0
Entering edit mode

Seems the answer is no, I cannot :)

ADD REPLY
1
Entering edit mode
10.9 years ago
Matt LaFave ▴ 310

If the IDs of the sequences (the bit after the >) are the same for identical sequences, you could do something like this:

cat file1.fa file2.fa | awk '{if($1 ~ /^>/){name=$1}else{print name"\t"$1}}' | sort | uniq | awk '{print $1"\n"$2}'

If the IDs are not the same, and you're only interested in the sequences themselves, you could get those with sed:

cat file1.fa file2.fa | sed -n '2~2p' | sort | uniq
ADD COMMENT
2
Entering edit mode

Hi Matt, your first command line could be simpler:

awk '{printf (/^>/) ? $0"\t" : $0"\n"}' file1.fa file2.fa | sort -u | tr "\t" "\n"

You can use a conditional expression to shorten the awk command; sort has a -u option to remove duplicates. In your second example, you can use awk to select only the sequences:

awk '! /^>/' file1.fa file2.fa | sort -u

ADD REPLY
0
Entering edit mode

Good to know - thanks!

ADD REPLY
0
Entering edit mode
10.9 years ago
Rm 8.3k

Try CD-Hit or Usearch See the similar question: Generating a non-redundant gene set

ADD COMMENT
0
Entering edit mode
10.9 years ago
ewre ▴ 250

not sure if you are looking for tools to remove duplicate lines, if so, using vim this can be done by: in command mode enter :sort ,hit enter to run and then enter :g/^(.*)$\n\1$/d

ADD COMMENT
0
Entering edit mode
9.8 years ago
Prakki Rama ★ 2.7k

This is not awk, but can be used for the same purpose. See FASTQ/A Collapser.

ADD COMMENT

Login before adding your answer.

Traffic: 2398 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6