Trim Poly-A Tail And Trailing Nucleotides
4
4
Entering edit mode
7.6 years ago
jon.brate ▴ 250

I have smallRNA seq data that were produced by poly-adenylating the 3' end prior to PCR amplification. For this reason many of the reads contain the poly-a stretches. But since the sequenced fragments were so short, sometimes the tail is sequenced through and there are nucleotides on the 3' end of the poly-a tail. Like this:

TACTGAATGGCAGTGATGATAAAAAAAAAAAAAAAAAAAATTCCGCC

I have used PRINSEQ to remove the poly-a tail, but it seems that it only remove tails at the very end of the sequence. Anyone know how i can remove tails including the following nucleotides like the read above?

Thanks,

Jon

trimming rna-seq • 7.8k views
4
Entering edit mode
7.6 years ago

using awk ?:

gunzip -c fastq.gz | awk 'NR%4==2 {gsub(/AAAAA.*$/,"");} {print}'  edit: 2017:  awk 'NR%4==2 {gsub(/AAAAA.*$/,"");N=length($0);} NR%4==0 {$0=substr($0,length($0)-N);} {print}'

1
Entering edit mode

I think this is a bit aggressive, now you can't have 5A's anywhere inside the reads.

0
Entering edit mode

Thanks! It seems to work fine. I don't fully understand this, but reads like this is not trimmed: ACGAGTAGGGGAAAAAAAAAAC. Can you explain why? Btw, do you think 10 A's is too short for a poly-a stretch?

0
Entering edit mode

echo -e "1\nACGAGTAGGGGAAAAAAAAAAC\n3\n4" | awk 'NR%4==2 {gsub(/AAAAA.*$/,"");} {print}' returns ACGAGTAGGGG here ADD REPLY 0 Entering edit mode This line gives me the same result as yours. But not when I run the command on my file of reads. Maybe I should have mentioned that I have a fasta file. It could be something with the fasta headers? They are like this: >HWI-ST486:386:D1UMHACXX:3:1101:2632:2144 1:N:0:TGACCA But the command works fine for the read I showed in the question... ADD REPLY 0 Entering edit mode Can you tell me how to delete the correspondig quality value of the trimmed poliA from the fastq? THANKS ADD REPLY 0 Entering edit mode Thanks for suggesting this awk command. It works perfectly. However, only the reads are trimmed and not the quality scores. Because of this the trimmed fastq file cannot be used for tophat, for instance. Can you please suggest a command to trim quality scores to match the reads trimmed using your awk command? I appreciate your help. ADD REPLY 1 Entering edit mode I've updated, can you please try the new command ? ADD REPLY 0 Entering edit mode Thanks so much! It works! ADD REPLY 0 Entering edit mode Thanks again for adding the new code that also trims the quality score as well. There is a slight issue with the code. It seems there is a mismatch between the length of read and quality score after trimming. The quality score has an extra character at the end. But, it could be fixed by modify the code slightly. Thanks. ADD REPLY 2 Entering edit mode 7.6 years ago I think I'd use the following perl regex: s/^(.*A{10,})[^A].*$/$1/;  To trimm off anything following after a polyA, at least one non A required, with minimum polyA length 10, or what you think is correct. Example: ATGCGCGTGTAAAAAAAAAAAGT => ATGCGCGTGTAAAAAAAAAAA  A polyA tail has been run through should appear in full length, so there is no reason to mess with shorter tails immediately at 3' end. Then run Prinseq on the "canonical" polyAs. Downside (of all regexp solutions): cannot handle sequencing errors inside polyA. ADD COMMENT 1 Entering edit mode Here's what your suggestion might look like: $ echo -e ">1\nACGAAAAAAGTAGGGGAAAAAAAAAAC" |\
perl -pe 's/^(.*A{5,})[^A].*$/$1/;' |\
perl prinseq-lite.pl -fasta stdin -trim_tail_right 8 -min_len 3 -out_format 1 -out_good stdout -out_bad null
>1
ACGAAAAAAGTAGGGG


This allows for a small stretch of As within the read. Though it's not a general solution (e.g., I don't think it will parse multi-line records), it demonstrates the point. The sequence parsing could easily be modified, and the trimming commands could be specified as a specific length or a percentage to fine-tune the trimming.

1
Entering edit mode
7.6 years ago
gammyknee ▴ 210

What is the quality of that polyA tail and remaining sequence 5' to the polyA? We work with aDNA (like you has small inserts) and we find that when the target insert and adapter has been sequenced we get the same polyA and random sequence left behind. Im not 100% on the reasoning behind it, but its certainly something we see all the time

0
Entering edit mode
7.6 years ago

How about trimming up to N (e.g. 10) last nucleotides, if they're not A and then running software like PRINSEQ? You can adjust the parameter N to get a good percentage of poly-A-trimmed reads on output.

0
Entering edit mode

Imo this is valid only if all reads contain a polyA sequence.

0
Entering edit mode

Well aren't those reads with polyA supposed to be selected by library prep. If yes than it could be wise to remove all non-polyA-containing sequences. Of course a more deep look into data, protocol and study objectives is required to make such decision...

0
Entering edit mode

Maybe..., only op knows their protocols if anyone ;)

0
Entering edit mode

All the sequenced fragments originally contained poly-a tail, but these are illumina 100 bp reads, and fragments longer than 100 bp will not be sequenced into the poly-a tail.