Question

Trim Poly-A Tail And Trailing Nucleotides

4

Entering edit mode

10.2 years ago

jon.brate ▴ 310

I have smallRNA seq data that were produced by poly-adenylating the 3' end prior to PCR amplification. For this reason many of the reads contain the poly-a stretches. But since the sequenced fragments were so short, sometimes the tail is sequenced through and there are nucleotides on the 3' end of the poly-a tail. Like this:

TACTGAATGGCAGTGATGATAAAAAAAAAAAAAAAAAAAATTCCGCC

I have used PRINSEQ to remove the poly-a tail, but it seems that it only remove tails at the very end of the sequence. Anyone know how i can remove tails including the following nucleotides like the read above?

Thanks,

Jon

trimming rna-seq • 10k views

ADD COMMENT • link updated 10.2 years ago by gammyknee ▴ 210 • written 10.2 years ago by jon.brate ▴ 310

score 4 · Answer 1 · 2014-02-17

4

Entering edit mode

10.2 years ago

Pierre Lindenbaum 161k

using awk ?:

gunzip -c fastq.gz | awk 'NR%4==2 {gsub(/AAAAA.*$/,"");} {print}'

edit: 2017:

 awk 'NR%4==2 {gsub(/AAAAA.*$/,"");N=length($0);} NR%4==0 {$0=substr($0,length($0)-N);} {print}'

ADD COMMENT • link 6.5 years ago by Pierre Lindenbaum 161k

1

Entering edit mode

I think this is a bit aggressive, now you can't have 5A's anywhere inside the reads.

ADD REPLY • link 10.2 years ago by Michael 54k

0

Entering edit mode

Thanks! It seems to work fine. I don't fully understand this, but reads like this is not trimmed: ACGAGTAGGGGAAAAAAAAAAC. Can you explain why? Btw, do you think 10 A's is too short for a poly-a stretch?

ADD REPLY • link 10.2 years ago by jon.brate ▴ 310

0

Entering edit mode

echo -e "1\nACGAGTAGGGGAAAAAAAAAAC\n3\n4" | awk 'NR%4==2 {gsub(/AAAAA.*$/,"");} {print}' returns ACGAGTAGGGG here

ADD REPLY • link 10.2 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

This line gives me the same result as yours. But not when I run the command on my file of reads. Maybe I should have mentioned that I have a fasta file. It could be something with the fasta headers? They are like this: >HWI-ST486:386:D1UMHACXX:3:1101:2632:2144 1:N:0:TGACCA

But the command works fine for the read I showed in the question...

ADD REPLY • link 10.2 years ago by jon.brate ▴ 310

0

Entering edit mode

Can you tell me how to delete the correspondig quality value of the trimmed poliA from the fastq?

THANKS

ADD REPLY • link 9.2 years ago by damiano.bassani ▴ 10

0

Entering edit mode

Thanks for suggesting this awk command. It works perfectly. However, only the reads are trimmed and not the quality scores. Because of this the trimmed fastq file cannot be used for tophat, for instance. Can you please suggest a command to trim quality scores to match the reads trimmed using your awk command? I appreciate your help.

ADD REPLY • link 6.5 years ago by mmitra ▴ 60

1

Entering edit mode

I've updated, can you please try the new command ?

ADD REPLY • link 6.5 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

Thanks so much! It works!

ADD REPLY • link 6.5 years ago by mmitra ▴ 60

0

Entering edit mode

Thanks again for adding the new code that also trims the quality score as well. There is a slight issue with the code. It seems there is a mismatch between the length of read and quality score after trimming. The quality score has an extra character at the end. But, it could be fixed by modify the code slightly. Thanks.

ADD REPLY • link 6.4 years ago by mmitra ▴ 60

score 2 · Answer 2 · 2014-02-17

2

Entering edit mode

10.2 years ago

Michael 54k

I think I'd use the following perl regex:

s/^(.*A{10,})[^A].*$/$1/;

To trimm off anything following after a polyA, at least one non A required, with minimum polyA length 10, or what you think is correct. Example:

ATGCGCGTGTAAAAAAAAAAAGT      
  =>
ATGCGCGTGTAAAAAAAAAAA

A polyA tail has been run through should appear in full length, so there is no reason to mess with shorter tails immediately at 3' end.

Then run Prinseq on the "canonical" polyAs.

Downside (of all regexp solutions): cannot handle sequencing errors inside polyA.

ADD COMMENT • link 10.2 years ago by Michael 54k

1

Entering edit mode

Here's what your suggestion might look like:

$ echo -e ">1\nACGAAAAAAGTAGGGGAAAAAAAAAAC" |\
perl -pe 's/^(.*A{5,})[^A].*$/$1/;' |\
perl prinseq-lite.pl -fasta stdin -trim_tail_right 8 -min_len 3 -out_format 1 -out_good stdout -out_bad null
>1
ACGAAAAAAGTAGGGG

This allows for a small stretch of As within the read. Though it's not a general solution (e.g., I don't think it will parse multi-line records), it demonstrates the point. The sequence parsing could easily be modified, and the trimming commands could be specified as a specific length or a percentage to fine-tune the trimming.

ADD REPLY • link 10.2 years ago by SES 8.6k

score 1 · Answer 3 · 2014-02-17

What is the quality of that polyA tail and remaining sequence 5' to the polyA? We work with aDNA (like you has small inserts) and we find that when the target insert and adapter has been sequenced we get the same polyA and random sequence left behind. Im not 100% on the reasoning behind it, but its certainly something we see all the time

score 0 · Answer 4 · 2014-02-17

0

Entering edit mode

10.2 years ago

mikhail.shugay 3.5k

How about trimming up to N (e.g. 10) last nucleotides, if they're not A and then running software like PRINSEQ? You can adjust the parameter N to get a good percentage of poly-A-trimmed reads on output.

ADD COMMENT • link 10.2 years ago by mikhail.shugay 3.5k

0

Entering edit mode

Imo this is valid only if all reads contain a polyA sequence.

ADD REPLY • link 10.2 years ago by Michael 54k

0

Entering edit mode

Well aren't those reads with polyA supposed to be selected by library prep. If yes than it could be wise to remove all non-polyA-containing sequences. Of course a more deep look into data, protocol and study objectives is required to make such decision...

ADD REPLY • link 10.2 years ago by mikhail.shugay 3.5k

0

Entering edit mode

Maybe..., only op knows their protocols if anyone ;)

ADD REPLY • link 10.2 years ago by Michael 54k

0

Entering edit mode

All the sequenced fragments originally contained poly-a tail, but these are illumina 100 bp reads, and fragments longer than 100 bp will not be sequenced into the poly-a tail.

ADD REPLY • link 10.2 years ago by jon.brate ▴ 310