Check Strandedness
0
0
Entering edit mode
11 months ago
Y • 0

I need to figure out the strandedness for the -s flag for regtools junctions extract used for Leafcutter. I get a peculiar error when using how_are_we_stranded_here.

Command Run:

check_strandedness --gtf path/to/Danio_rerio.GRCz11.110.chr.gtf --transcripts /path/to/Danio_rerio.GRCz11.dna_sm.primary_assembly.fa --reads_1 Sample_1_R1.fq.gz --reads_2 Sample_1_R2.fq.gz

Gives the output:

Results stored in: stranded_test_Sample_1_R1
converting gtf to bed
Checking if fasta headers and bed file transcript_ids match...
Can't find transcript ids from /path/to/Danio_rerio.GRCz11.dna_sm.primary_assembly.fa in stranded_test_Sample_1_R1/Danio_rerio.GRCz.chr.bed
Trying to converting fasta header format to match transcript ids to the BED file...
Can't find any of the first 10 BED transcript_ids in fasta file... Check that these match

Why would this occur when the.bed file clearly has rows?

Check_Strandedness how_are_we_stranded_here leafcutter bash • 1.8k views
ADD COMMENT
1
Entering edit mode

Error is about a mismatch in ID's that are in fasta headers and your BED (column 1) file.

ADD REPLY
1
Entering edit mode

It looks like you might be using a genomic fasta instead of a transciptome fasta?

ADD REPLY
0
Entering edit mode

What is the difference? How would I find a transcriptome fasta?

ADD REPLY
2
Entering edit mode

Genomic fasta contains entire genome sequence. Transcriptome file contains just expressed sequences. You will find the Ensembl version of the Danio rerio transcriptome file here: https://ftp.ensembl.org/pub/release-110/fasta/danio_rerio/cdna/Danio_rerio.GRCz11.cdna.all.fa.gz

This line is a space hog.

ADD REPLY
0
Entering edit mode

Is there something wrong with the check_strandedness or the line I fed in:

Results stored in: stranded_test_Control_Sample_1_R1
converting gtf to bed
using /path/to/Danio_Rerio_Transcriptome_Kallisto.idx as kallisto index
creating fastq files with first 200000 reads
quantifying with kallisto

[quant] fragment length distribution will be estimated from the data
[index] k-mer length: 30
[index] number of targets: 60,000
[index] number of k-mers: 69,000,000
[index] number of equivalence classes: 200,000
Warning: 8000 transcripts were defined in GTF file, but not in the index
[quant] running in paired-end mode
[quant] will process pair 1: stranded_test_Control_Sample_1_R1/Control_Sample_1_R1.fq
                             Control_Sample_1_R2_sample.fq
[quant] finding pseudoalignments for the reads ... done
[quant] processed 200,001 reads, 100,000 reads pseudoaligned
[quant] estimated average fragment length: 199.111
[   em] quantifying the abundances ... done
[   em] the Expectation-Maximization algorithm ran for 777 rounds
[  bam] writing pseudoalignments to BAM format .. done
[  bam] sorting BAM files .. done
[  bam] indexing BAM file .. done

checking strandedness
Reading reference gene model stranded_test_Control_Sample_1_R1/Danio_rerio.GRCz11.110.chr.bed ... Done
Loading SAM/BAM file ...  Finished
Total 200000 usable reads were sampled
Traceback (most recent call last):
  File "/path/to/check_strandedness", line 8, in <module>
    sys.exit(main())
  File "/path/to/Python3.10/check_strandedness.py", line 185, in main
    result = pd.read_csv(test_folder + '/' + 'strandedness_check.txt', sep="\n", header=None)
  File "/path/to/pandas/io/parsers/readers.py", line 899, in read_csv
    kwds_defaults = _refine_defaults_read(
  File "/path/to/pandas/io/parsers/readers.py", line 1971, in _refine_defaults_read
    raise ValueError(
ValueError: Specified \n as separator or delimiter. This forces the python engine which does not accept a line terminator. Hence it is not allowed to use the line terminator as separator.

The command I used:

check_strandedness --gtf /path/to/Danio_rerio.GRCz11.110.chr.gtf -fa /path/to/Danio_rerio.GRCz11.cdna.all.fa -r1 Control_Sample_1_R1.fq.gz -r2 Control_Sample_1_R2.fq.gz -k /path/to/Danio_Rerio_Transcriptome_Kallisto.idx

The Callisto index was created using:

kallisto index -i Danio_Rerio_Transcriptome_Kallisto.idx Danio_rerio.GRCz11.cdna.all.fa

What am I doing incorrectly?

ADD REPLY
0
Entering edit mode

Are you shuffling files between windows/mac and Unix. It sounds like the problem may be with file endings and can be fixed by dos2unix type utils.

ADD REPLY
0
Entering edit mode

No I am on a linux HPC. I unzipped the fasta using gunzip is that the issue?

ADD REPLY
0
Entering edit mode

I saw a reply on Biostars here that said that kb ref can be used to create a fasta but then when I need to kallisto index it using

kallisto index -i Danio_Rerio_Transcriptome_Kallisto.idx transcriptome_built_using_kb_python_Danio_rerio.fa

I get the following weird error

[build] loading fasta file transcriptome_built_using_kb_python_Danio_rerio.fa
[build] k-mer length: 31
[build] warning: clipped off poly-A tail (longer than 10)
        from 380 target sequences
[build] warning: replaced 13167 non-ACGUT characters in the input sequence
        with pseudorandom nucleotides
[build] counting k-mers ... Aborted (core dumped)
ADD REPLY
0
Entering edit mode

I don't use kallisto so can't say what could be wrong.

You may want to try salmon and let it find out the strandedness as an alternative. See: https://salmon.readthedocs.io/en/latest/salmon.html#what-s-this-libtype

ADD REPLY
0
Entering edit mode

Thank you it seems easier just to ask the company for what kit they used to know the strandedness at this point.

ADD REPLY
0
Entering edit mode

checkstrand tool (from BBMap suite) question

Another strand-checking tool if you still need one.

ADD REPLY

Login before adding your answer.

Traffic: 3098 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6