Hello,
I currently creating a cDNA library.
We might have to switch from a an in-fusion to an restriction approach, where we would cut the cDNA with SfiI. My supervisor said that the cutting site for SfiI is rare in human genes and was wondering if I could test how many genes are lost with SfiI cutting.
So I did the following:
I obtained Homo_sapiens.GRCh38.111.gff3
and Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa
from the ensemble ftp server. I created the motif for SfiI based on https://www.neb.com/en/products/r0123-sfii .
$ cat sfiimotif.meme
MEME version 4
ALPHABET= ACGT
strands: + -
Background letter frequencies (from uniform background):
A 0.25000 C 0.25000 G 0.25000 T 0.25000
MOTIF SfiI SfiI
letter-probability matrix: alength= 4 w= 13 nsites= 20 E= 0
0.000000 0.000000 1.000000 0.000000
0.000000 0.000000 1.000000 0.000000
0.000000 1.000000 0.000000 0.000000
0.000000 1.000000 0.000000 0.000000
0.250000 0.250000 0.250000 0.250000
0.250000 0.250000 0.250000 0.250000
0.250000 0.250000 0.250000 0.250000
0.250000 0.250000 0.250000 0.250000
0.250000 0.250000 0.250000 0.250000
0.000000 0.000000 1.000000 0.000000
0.000000 0.000000 1.000000 0.000000
0.000000 1.000000 0.000000 0.000000
0.000000 1.000000 0.000000 0.000000
grep "ID=gene" genome.gff3 | grep "biotype=protein_coding" > genome_genesonly.gff3
sed -i.old '1s;^;##seqid\tsource\ttype\tstart\tend\tscore\tstrand\tphase\tattributes\n;' genome_genesonly.gff3
python gff3_to_genes.py -i genome_genesonly.gff3 --output-geneidlist geneids_genome --output-genelist /tmp/genes_genome -q
agat_sp_filter_feature_from_keep_list.pl --gff ../../reference_dbs/human/Homo_sapiens.GRCh38.111.gff3 --keep_list geneids_genome -a gene_id -o genome_trafo.gff3
agat_sp_extract_sequences.pl -g genome_trafo.gff3 -t gene -f Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa --output genome.fa
sed "s/gene://" genome.fa > genome_fixedheader.fa
This way, I wanted to obtain the sequences of protein coding genes.
fimo --verbosity 2 --bgfile --nrdb-- --thresh 5.0E-5 --max-stored-scores 1000000000 --norc --oc outfimo sfiimotif.meme genome_fixedheader.fa
I get a lot of hits with that (~400000).
I also tried running with a proper --bgfile
instead of the --nrdb
, with a moderate difference.
I then obtained the number of genes with the SfiI cleavage site using
$ cat analyse_sites.py
import pandas as pd
sfi_results = pd.read_csv("outfimo/fimo.tsv", sep="\t")
print(len(sfi_results["sequence_name"].unique()))
$ python analyse_sites.py
$ 18874
Originally, I had ~20000 genes to begin with:
wc -l geneids_genome
20073 geneids_genome
So that doesn't really make sense as the result implies that 95% of the genes would be cut. But I did check some of the sites and they are actually in the sequence, so fimo seems to work properly. I just wonder whether my sequence is correct but for some examples it did fit the ensemble browser output.
I know that people have discussed that the output must be filtered due to a multiple testing problem. I tried, with --qv-thresh
and --thresh 0.1
. That doesn't give me even a single hit.
Also, I don't quite know why statistics are done on the result anyway. Are we accounting for uncertainties in sequencing or something? As from understanding, either a site matches a motif or it doesn't. I don't get why we score matches with p-values in this case.
I'm a bit at a loss, as neither result makes sense. From what my advisor told me, I expected <5% of the protein-coding genes to contain SfiI sites. Does it make sense to look at CDS/exons only? Is my approach sound (to find genes that will be cleaved) or are there any obvious flaws?
To add another wrinkle, NEB is listing this enzyme as one that requires binding at more than one site for optimal cleavage. So a simplistic estimation like the one you are attempting may not translate in practice. You may actually need to do this experiment and do some titrations.