I'm trying to create a piRNA count table from samples enriched with small RNAs (~31 bases) using the piRBase reference. Despite all my readings I haven't find a simple way to do that.
In the first place I tried to quantifiy piRNAs with a similar strategy as miRNAs by aligning trimmed reads with Bowtie to piRBase (instead of miRBase) but I underestimated the multimapping problem: more than 99% of the aligning reads multimap the reference even when requiring 0 mistmatch (bowtie -v 0). Indeed the drosophila piRNA reference contains >41 million sequences.
Among all the perfect matches I would be interested in prioritizing the reference sequences that have no additionnal bases.
For example the read TGCTTGGACTACATATGGTTGAGGGTTGTA perfectly matches many piRNAs in the piRBase:
The other references/databases have less piRNAs but they still have sequences that are subsequences of others, therefore it doesn't solve the multi-mapping problem when using aligners like Bowtie. In most papers/pipelines I have seen they map to the genome which is worse in terms of multi-mapping and I'm not trying to discover new piRNAs. I ended up writting a custom script to quantify piRBase piRNAs (in my case >95% of the FASTQ sequences found a perfect match).
In case someone is interested, the way I did (requires at least 10GB of RAM):
Parse the piRBase FASTA to create a map:<sequence> -> <sequence name> (e.g. with a dict in Python or a list in R)
Parse the sample FASTQ:
if you have UMIs to account for PCR bias, create a map <sequence> -> seen UMIs: the count will be the number of unique UMIs
Search against the piRBase, generate the piRNA count table.
## extract the piRNAs
# big.fa.gz, is the piRBase file (eg: 42 million records)
# small.fa.gz, is the query,
$ seqkig grep -s -f <(seqkit seq -s small.fa) big.fa.gz > out.fa
## custome script
# extract the read count for each sequence from (small.fa.gz)
The piRBase including too many sequences (~42 million records for dme, v2.0); Always we have 10+ million reads for each small RNA sample. It is not a good idea to search against 42 records with many 10+ records query.
So I tried to reduce the query size as much as possible