Question

Trouble with transcripts_id when using how_are_we_stranded_here python package

0

Entering edit mode

13 months ago

ExtentHonest56 ▴ 10

I am trying to run the python package how_are_we_stranded_here on my university's supercomputer to use the check_strandedness function.

https://github.com/signalbash/how_are_we_stranded_here#readme

I was able to successfully install the program through the interface after using conda/mamba (since this works best with our university's interface) rather than pip. I don't think the issue is the installation or calling of the program, since I exported the path when submitting the job script and the error does not include issues related to that. The output and error file I am getting are:

Output file:

Results stored in: stranded_test_190007-1_R1_001
converting gtf to bed
Checking if fasta headers and bed file transcript_ids match...
Can't find transcript ids from /results/RefGenome/genome.fa in stranded_test_190007-1_R1_001/Bos_taurus.ARS-UCD1.2.109.bed
Trying to converting fasta header format to match transcript ids to the BED file...

Error file:

Can't find any of the first 10 BED transcript_ids in fasta file... Check that these match

I first tried to run check_strandedness with the Bos Taurus genome gtf and Bos Taurus transcriptome (cDNA sequence) which also failed. I assumed it is because my input was a genome gtf while my reference was a transcriptome, so I switched both to the genome for the second run which also failed. There is no transcriptome (cDNA) gtf file that I could find on ensembl, and I am planning to align to the genome first. I have already aligned successfully, but want to check the strandedness to ensure the arguments I used for some of the programs are correct.

Has anyone else run into this issue and knows how to solve it? Do I have to use a transcriptome gtf and transcriptome cDNA fasta for this to work? If so, how would I obtain/make my own transcriptome gtf since one is not available?

Thank you so much in advance to anyone who can help with this!

Here is my script that I submitted as a SLURM job:

cd /RNAseqBlood2022/results/Check_strandedness/

export PATH=/conda/envs/hfrl/bin/:$PATH

check_strandedness --gtf /RNAseqBlood2022/results/GTF/Bos_taurus.ARS-UCD1.2.109.gtf --transcripts /RNAseqBlood2022/results/RefGenome/genome.fa --reads_1 /RNAseqBlood2022/data/190007-1_R1_001.fastq.gz --reads_2 /RNAseqBlood2022/data/190007-1_R2_001.fastq.gz

bash sequencing RNA-seq • 1.1k views

ADD COMMENT • link 13 months ago by ExtentHonest56 ▴ 10

1

Entering edit mode

Here are the two files you need from Ensembl:

Bos taurus GTF file: https://ftp.ensembl.org/pub/release-109/gtf/bos_taurus/Bos_taurus.ARS-UCD1.2.109.gtf.gz

Bos taurus cDNA file: https://ftp.ensembl.org/pub/release-109/fasta/bos_taurus/cdna/Bos_taurus.ARS-UCD1.2.cdna.all.fa.gz

Names in these two files match so use these.

>ENSBTAT00000085088.1 cdna primary_assembly:ARS-UCD1.2:10:25460980:25482775:-1 gene:ENSBTAG00000052108.1 gene_biotype:TR_V_gene transcript_biotype:TR_V_gene
AAGGAGCAAGTATTTCAGTCTCCCACTGTGGTCTCTTTGGAGGGAGCTGTGGCAGAAAT

GTF file will contain the name of this transcript

$ zgrep ENSBTAT00000085088  Bos_taurus.ARS-UCD1.2.109.gtf.gz 
10      ensembl transcript      25460980        25482775        .       -       .       gene_id "ENSBTAG00000052108"; gene_version "1"; transcript_id "ENSBTAT00000085088"; transcript_version "1"; gene_source "ensembl"; gene_biotype "TR_V_gene"; transcript_source "ensembl"; transcript_biotype "TR_V_gene"; tag "Ensembl_canonical";

ADD REPLY • link 13 months ago by GenoMax 141k

1

Entering edit mode

You can also try https://github.com/NBISweden/GUESSmyLT

ADD REPLY • link 13 months ago by GenoMax 141k

0

Entering edit mode

Thank you so much, this is extremely helpful information! I will give these a try :)

ADD REPLY • link 13 months ago by ExtentHonest56 ▴ 10

0

Entering edit mode

Update for anyone trying to check library type computationally and having difficulties with some python based packages:

After attempting to get check_strandedness and GUESSmyLT program packages to work repeatedly with no luck, Salmon actually worked well for this purpose.

The way our university integrates python is interesting, so python packages downloaded need to be done so with conda/mamba and a few other additional steps. This is a headache, and I was unable to get these programs to work properly in my situation.

Using:

salmon quant --libType A -i <index> -1 <read1file> -2 <read2file> -o <outputdirectory>

Specifically including --libType A option (and any other options preferred), an estimate of the library type was in some of the output files.

The sequencing company used for our data only informed us of the library construction protocol, which if unmodified, would result in stranded data. However, they did not mention that some adjustments were made, and my data was actually unstranded (glad I checked!).

ADD REPLY • link 13 months ago by ExtentHonest56 ▴ 10