Question: Trim Poly-A Tail And Trailing Nucleotides
2
gravatar for jon.brate
4.8 years ago by
jon.brate190
Norway
jon.brate190 wrote:

I have smallRNA seq data that were produced by poly-adenylating the 3' end prior to PCR amplification. For this reason many of the reads contain the poly-a stretches. But since the sequenced fragments were so short, sometimes the tail is sequenced through and there are nucleotides on the 3' end of the poly-a tail. Like this:

TACTGAATGGCAGTGATGATAAAAAAAAAAAAAAAAAAAATTCCGCC

I have used PRINSEQ to remove the poly-a tail, but it seems that it only remove tails at the very end of the sequence. Anyone know how i can remove tails including the following nucleotides like the read above?

Thanks,

Jon

trimming rna-seq • 5.4k views
ADD COMMENTlink modified 4.8 years ago by gammyknee170 • written 4.8 years ago by jon.brate190
2
gravatar for Pierre Lindenbaum
4.8 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum115k wrote:

using awk ?:

gunzip -c fastq.gz | awk 'NR%4==2 {gsub(/AAAAA.*$/,"");} {print}'

edit: 2017:

 awk 'NR%4==2 {gsub(/AAAAA.*$/,"");N=length($0);} NR%4==0 {$0=substr($0,length($0)-N);} {print}'
ADD COMMENTlink modified 13 months ago • written 4.8 years ago by Pierre Lindenbaum115k
1

I think this is a bit aggressive, now you can't have 5A's anywhere inside the reads.

ADD REPLYlink written 4.8 years ago by Michael Dondrup45k

Thanks! It seems to work fine. I don't fully understand this, but reads like this is not trimmed: ACGAGTAGGGGAAAAAAAAAAC. Can you explain why? Btw, do you think 10 A's is too short for a poly-a stretch?

ADD REPLYlink written 4.8 years ago by jon.brate190

echo -e "1\nACGAGTAGGGGAAAAAAAAAAC\n3\n4" | awk 'NR%4==2 {gsub(/AAAAA.*$/,"");} {print}' returns ACGAGTAGGGG here

ADD REPLYlink written 4.8 years ago by Pierre Lindenbaum115k

This line gives me the same result as yours. But not when I run the command on my file of reads. Maybe I should have mentioned that I have a fasta file. It could be something with the fasta headers? They are like this: >HWI-ST486:386:D1UMHACXX:3:1101:2632:2144 1:N:0:TGACCA

But the command works fine for the read I showed in the question...

ADD REPLYlink modified 4.8 years ago • written 4.8 years ago by jon.brate190

Can you tell me how to delete the correspondig quality value of the trimmed poliA from the fastq?

THANKS

ADD REPLYlink written 3.8 years ago by damiano.bassani10

Thanks for suggesting this awk command. It works perfectly. However, only the reads are trimmed and not the quality scores. Because of this the trimmed fastq file cannot be used for tophat, for instance. Can you please suggest a command to trim quality scores to match the reads trimmed using your awk command? I appreciate your help.

ADD REPLYlink written 13 months ago by mmitra30
1

I've updated, can you please try the new command ?

ADD REPLYlink written 13 months ago by Pierre Lindenbaum115k

Thanks so much! It works!

ADD REPLYlink written 13 months ago by mmitra30

Thanks again for adding the new code that also trims the quality score as well. There is a slight issue with the code. It seems there is a mismatch between the length of read and quality score after trimming. The quality score has an extra character at the end. But, it could be fixed by modify the code slightly. Thanks.

ADD REPLYlink modified 13 months ago • written 13 months ago by mmitra30
2
gravatar for Michael Dondrup
4.8 years ago by
Bergen, Norway
Michael Dondrup45k wrote:

I think I'd use the following perl regex:

s/^(.*A{10,})[^A].*$/$1/;

To trimm off anything following after a polyA, at least one non A required, with minimum polyA length 10, or what you think is correct. Example:

ATGCGCGTGTAAAAAAAAAAAGT      
  =>
ATGCGCGTGTAAAAAAAAAAA

A polyA tail has been run through should appear in full length, so there is no reason to mess with shorter tails immediately at 3' end.

Then run Prinseq on the "canonical" polyAs.

Downside (of all regexp solutions): cannot handle sequencing errors inside polyA.

ADD COMMENTlink modified 4.8 years ago • written 4.8 years ago by Michael Dondrup45k
1

Here's what your suggestion might look like:

$ echo -e ">1\nACGAAAAAAGTAGGGGAAAAAAAAAAC" |\
perl -pe 's/^(.*A{5,})[^A].*$/$1/;' |\
perl prinseq-lite.pl -fasta stdin -trim_tail_right 8 -min_len 3 -out_format 1 -out_good stdout -out_bad null
>1
ACGAAAAAAGTAGGGG

This allows for a small stretch of As within the read. Though it's not a general solution (e.g., I don't think it will parse multi-line records), it demonstrates the point. The sequence parsing could easily be modified, and the trimming commands could be specified as a specific length or a percentage to fine-tune the trimming.

ADD REPLYlink written 4.8 years ago by SES8.1k
1
gravatar for gammyknee
4.8 years ago by
gammyknee170
Australia
gammyknee170 wrote:

What is the quality of that polyA tail and remaining sequence 5' to the polyA? We work with aDNA (like you has small inserts) and we find that when the target insert and adapter has been sequenced we get the same polyA and random sequence left behind. Im not 100% on the reasoning behind it, but its certainly something we see all the time

ADD COMMENTlink written 4.8 years ago by gammyknee170
0
gravatar for mikhail.shugay
4.8 years ago by
mikhail.shugay3.3k
Czech Republic, Brno, CEITEC
mikhail.shugay3.3k wrote:

How about trimming up to N (e.g. 10) last nucleotides, if they're not A and then running software like PRINSEQ? You can adjust the parameter N to get a good percentage of poly-A-trimmed reads on output.

ADD COMMENTlink written 4.8 years ago by mikhail.shugay3.3k

Imo this is valid only if all reads contain a polyA sequence.

ADD REPLYlink written 4.8 years ago by Michael Dondrup45k

Well aren't those reads with polyA supposed to be selected by library prep. If yes than it could be wise to remove all non-polyA-containing sequences. Of course a more deep look into data, protocol and study objectives is required to make such decision...

ADD REPLYlink written 4.8 years ago by mikhail.shugay3.3k

Maybe..., only op knows their protocols if anyone ;)

ADD REPLYlink modified 4.8 years ago • written 4.8 years ago by Michael Dondrup45k

All the sequenced fragments originally contained poly-a tail, but these are illumina 100 bp reads, and fragments longer than 100 bp will not be sequenced into the poly-a tail.

ADD REPLYlink written 4.8 years ago by jon.brate190
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1166 users visited in the last hour