I am trying to run the python package how_are_we_stranded_here on my university's supercomputer to use the check_strandedness function.
I was able to successfully install the program through the interface after using conda/mamba (since this works best with our university's interface) rather than pip. I don't think the issue is the installation or calling of the program, since I exported the path when submitting the job script and the error does not include issues related to that. The output and error file I am getting are:
Results stored in: stranded_test_190007-1_R1_001 converting gtf to bed Checking if fasta headers and bed file transcript_ids match... Can't find transcript ids from /results/RefGenome/genome.fa in stranded_test_190007-1_R1_001/Bos_taurus.ARS-UCD1.2.109.bed Trying to converting fasta header format to match transcript ids to the BED file...
Can't find any of the first 10 BED transcript_ids in fasta file... Check that these match
I first tried to run check_strandedness with the Bos Taurus genome gtf and Bos Taurus transcriptome (cDNA sequence) which also failed. I assumed it is because my input was a genome gtf while my reference was a transcriptome, so I switched both to the genome for the second run which also failed. There is no transcriptome (cDNA) gtf file that I could find on ensembl, and I am planning to align to the genome first. I have already aligned successfully, but want to check the strandedness to ensure the arguments I used for some of the programs are correct.
Has anyone else run into this issue and knows how to solve it? Do I have to use a transcriptome gtf and transcriptome cDNA fasta for this to work? If so, how would I obtain/make my own transcriptome gtf since one is not available?
Thank you so much in advance to anyone who can help with this!
Here is my script that I submitted as a SLURM job:
cd /RNAseqBlood2022/results/Check_strandedness/ export PATH=/conda/envs/hfrl/bin/:$PATH check_strandedness --gtf /RNAseqBlood2022/results/GTF/Bos_taurus.ARS-UCD1.2.109.gtf --transcripts /RNAseqBlood2022/results/RefGenome/genome.fa --reads_1 /RNAseqBlood2022/data/190007-1_R1_001.fastq.gz --reads_2 /RNAseqBlood2022/data/190007-1_R2_001.fastq.gz
Here are the two files you need from Ensembl:
Bos taurus GTF file: https://ftp.ensembl.org/pub/release-109/gtf/bos_taurus/Bos_taurus.ARS-UCD1.2.109.gtf.gz
Bos taurus cDNA file: https://ftp.ensembl.org/pub/release-109/fasta/bos_taurus/cdna/Bos_taurus.ARS-UCD1.2.cdna.all.fa.gz
Names in these two files match so use these.
GTF file will contain the name of this transcript
You can also try https://github.com/NBISweden/GUESSmyLT
Thank you so much, this is extremely helpful information! I will give these a try :)
Update for anyone trying to check library type computationally and having difficulties with some python based packages:
After attempting to get check_strandedness and GUESSmyLT program packages to work repeatedly with no luck, Salmon actually worked well for this purpose.
The way our university integrates python is interesting, so python packages downloaded need to be done so with conda/mamba and a few other additional steps. This is a headache, and I was unable to get these programs to work properly in my situation.
--libType Aoption (and any other options preferred), an estimate of the library type was in some of the output files.
The sequencing company used for our data only informed us of the library construction protocol, which if unmodified, would result in stranded data. However, they did not mention that some adjustments were made, and my data was actually unstranded (glad I checked!).