Question

Removing reads based on a pattern in the sequence name

0

Entering edit mode

8.1 years ago

fufuyou ▴ 110

Hi, I have some reads like as:

>seq_5150639_x4
GGAACGAGATCGTCTGCAGTTGGC
>seq_5150619_x40
AACCGCCTGTAGAAATGCATGATT

X4 or X40 indicate how many reads same with the read. I want to remove lower than 10. How can I remove them? Thanks, Fuyou

RNA-Seq • 1.1k views

ADD COMMENT • link updated 8.1 years ago by Istvan Albert 100k • written 8.1 years ago by fufuyou ▴ 110

0

Entering edit mode

Do you mean you want to remove reads whose ids containing x10, x9 ... ?

ADD REPLY • link 8.1 years ago by venu 7.1k

0

Entering edit mode

the anwser is awk, again.

please, validate your previous questions: sequence head change! ; Add a sequence name! ; ...

ADD REPLY • link 8.1 years ago by Pierre Lindenbaum 161k

score 2 · Answer 1 · 2016-04-06

Assuming the FASTA input is single-line:

Parse each FASTA record header with awk and get the last part of the record header, split by the underscore (_)
Strip the first character from the last part and cast the rest to an integer
If the integer is greater than or equal to the threshold (10), set a flag value to a true-like value and print the header
Print the FASTA record sequence if the flag is set to a true-like value

For example:

$ awk '{ \
    if ($0 ~ /^>/) { \
      n = split($0, a, "_");  \
      r = int(substr(a[n], 2)); \
      f = 0; \
      if (r >= 10) { \
        f = 1; \
        print $0; \
      } \
    } \
    else if (f) { \
      print $0; \
    } \
  }' input.fa > output.fa