Question

How to quantify piRNAs ?

0

Entering edit mode

3.0 years ago

Tom • 0

Hi

I'm trying to create a piRNA count table from samples enriched with small RNAs (~31 bases) using the piRBase reference. Despite all my readings I haven't find a simple way to do that.

In the first place I tried to quantifiy piRNAs with a similar strategy as miRNAs by aligning trimmed reads with Bowtie to piRBase (instead of miRBase) but I underestimated the multimapping problem: more than 99% of the aligning reads multimap the reference even when requiring 0 mistmatch (bowtie -v 0). Indeed the drosophila piRNA reference contains >41 million sequences.

Among all the perfect matches I would be interested in prioritizing the reference sequences that have no additionnal bases.

For example the read TGCTTGGACTACATATGGTTGAGGGTTGTA perfectly matches many piRNAs in the piRBase:

>piR-dme-179
TGCTTGGACTACATATGGTTGAGGGTTGTA
>piR-dme-183
CTGCTTGGACTACATATGGTTGAGGGTTGTA
>piR-dme-800
TGCTTGGACTACATATGGTTGAGGGTTGTAA
>piR-dme-46776
ATGCTTGGACTACATATGGTTGAGGGTTGTA
>piR-dme-48715
GCTGCTTGGACTACATATGGTTGAGGGTTGTA
...

For that case I would like to count only piR-dme-179 because it has no additionnal bases. Are there tools that could help efficiently do that ? Maybe using a reciprocal matching criterion.

I hope you can help, thank you

piRNA piRBase • 1.9k views

ADD COMMENT • link updated 2.6 years ago by wm ▴ 560 • written 3.0 years ago by Tom • 0

0

Entering edit mode

did you try a manually curated database? as described here

ADD REPLY • link 3.0 years ago by Buffo ★ 2.4k

0

Entering edit mode

Past thread that may be useful: Suggest Pirna Ngs Small Rna Analysis Pipelines/Workflows

ADD REPLY • link 3.0 years ago by GenoMax 142k

0

Entering edit mode

Maybe you can create a new fasta file with your "unique" piRNA fasta sequences and use it as a "reference genome" to perform the alignment with Bowtie. Then, you can obtain counts piRNA data for example with summarizeOverlaps R function.

ADD REPLY • link 3.0 years ago by nlt_220593 ▴ 20

score 0 · Answer 1 · 2021-05-19

0

Entering edit mode

3.0 years ago

Tom • 0

Thank you.

The other references/databases have less piRNAs but they still have sequences that are subsequences of others, therefore it doesn't solve the multi-mapping problem when using aligners like Bowtie. In most papers/pipelines I have seen they map to the genome which is worse in terms of multi-mapping and I'm not trying to discover new piRNAs. I ended up writting a custom script to quantify piRBase piRNAs (in my case >95% of the FASTQ sequences found a perfect match).

In case someone is interested, the way I did (requires at least 10GB of RAM):

Parse the piRBase FASTA to create a map:<sequence> -> <sequence name> (e.g. with a dict in Python or a list in R)
Parse the sample FASTQ:
- if you have UMIs to account for PCR bias, create a map <sequence> -> seen UMIs: the count will be the number of unique UMIs
- otherwise map <sequence> -> number of appearances
Write the count table from these 2 mappings

ADD COMMENT • link 3.0 years ago by Tom • 0

0

Entering edit mode

Please do not delete threads once they have received a comment/answer. While we do not encourage this you could accept your own answer to provide closure to this thread. Ideally if you can share the custom script via a GitHub gist or by including it here it would be great.

ADD REPLY • link 3.0 years ago by GenoMax 142k

0

Entering edit mode

Respectfully I think you're trying to reinvent the wheel. Sometimes other DBs have fewer sequences because they have been curated, which means all redundancy or as much redundancy as possible has been discarded. There are no perfect DBs but it does not mean they are useless.

they still have sequences that are subsequences of others

In my opinion, it does not mean they are wrong/redundant (and even less regarding pirnas). If you are not interested in non-previously described piRNAs you can use curated databases and standard pipelines, in your case probably using clustering (local/global alignments) would be a good option.

ADD REPLY • link 3.0 years ago by Buffo ★ 2.4k

score 0 · Answer 2 · 2021-10-08

0

Entering edit mode

2.6 years ago

benformatics 3.9k

All of the sequences linked above and annotated as piRNA are most likely 2S rRNA derived fragments and not piRNA: https://www.ncbi.nlm.nih.gov/nuccore/V00236

ADD COMMENT • link 2.6 years ago by benformatics 3.9k

score 0 · Answer 3 · 2021-10-12

In my case, I use the following way to quantify piRNA counts;

Collapse reads (and remove UMIs, if exists), here I have the read count
(optional) remove reads by size (eg: exclude reads less than 23nt, that could be siRNAs, miRNAs, ...)
(optional) remove reads by read count (eg: exclude read_count == 1)
Search against the piRBase, generate the piRNA count table.

## extract the piRNAs
# big.fa.gz, is the piRBase file (eg: 42 million records)  
# small.fa.gz, is the query, 
$ seqkig grep -s -f <(seqkit seq -s small.fa) big.fa.gz > out.fa

## custome script
# extract the read count for each sequence from (small.fa.gz)

The piRBase including too many sequences (~42 million records for dme, v2.0); Always we have 10+ million reads for each small RNA sample. It is not a good idea to search against 42 records with many 10+ records query.
So I tried to reduce the query size as much as possible