Removing reads based on a pattern in the sequence name
1
0
Entering edit mode
8.1 years ago
fufuyou ▴ 110

Hi, I have some reads like as:

>seq_5150639_x4
GGAACGAGATCGTCTGCAGTTGGC
>seq_5150619_x40
AACCGCCTGTAGAAATGCATGATT

X4 or X40 indicate how many reads same with the read. I want to remove lower than 10. How can I remove them? Thanks, Fuyou

RNA-Seq • 1.1k views
ADD COMMENT
0
Entering edit mode

Do you mean you want to remove reads whose ids containing x10, x9 ... ?

ADD REPLY
0
Entering edit mode

the anwser is awk, again.

please, validate your previous questions: sequence head change! ; Add a sequence name! ; ...

ADD REPLY
2
Entering edit mode
8.1 years ago

Assuming the FASTA input is single-line:

  1. Parse each FASTA record header with awk and get the last part of the record header, split by the underscore (_)
  2. Strip the first character from the last part and cast the rest to an integer
  3. If the integer is greater than or equal to the threshold (10), set a flag value to a true-like value and print the header
  4. Print the FASTA record sequence if the flag is set to a true-like value

For example:

$ awk '{ \
    if ($0 ~ /^>/) { \
      n = split($0, a, "_");  \
      r = int(substr(a[n], 2)); \
      f = 0; \
      if (r >= 10) { \
        f = 1; \
        print $0; \
      } \
    } \
    else if (f) { \
      print $0; \
    } \
  }' input.fa > output.fa
ADD COMMENT
0
Entering edit mode

Thanks, It works well. Fuyou

ADD REPLY

Login before adding your answer.

Traffic: 3003 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6