Question

Check Strandedness

0

Entering edit mode

8 months ago

Y • 0

I need to figure out the strandedness for the -s flag for regtools junctions extract used for Leafcutter. I get a peculiar error when using how_are_we_stranded_here.

Command Run:

check_strandedness --gtf path/to/Danio_rerio.GRCz11.110.chr.gtf --transcripts /path/to/Danio_rerio.GRCz11.dna_sm.primary_assembly.fa --reads_1 Sample_1_R1.fq.gz --reads_2 Sample_1_R2.fq.gz

Gives the output:

Results stored in: stranded_test_Sample_1_R1
converting gtf to bed
Checking if fasta headers and bed file transcript_ids match...
Can't find transcript ids from /path/to/Danio_rerio.GRCz11.dna_sm.primary_assembly.fa in stranded_test_Sample_1_R1/Danio_rerio.GRCz.chr.bed
Trying to converting fasta header format to match transcript ids to the BED file...
Can't find any of the first 10 BED transcript_ids in fasta file... Check that these match

Why would this occur when the.bed file clearly has rows?

Check_Strandedness how_are_we_stranded_here leafcutter bash • 1.6k views

ADD COMMENT • link updated 6 months ago by Brian Bushnell 20k • written 8 months ago by Y • 0

1

Entering edit mode

Error is about a mismatch in ID's that are in fasta headers and your BED (column 1) file.

ADD REPLY • link 8 months ago by GenoMax 141k

1

Entering edit mode

It looks like you might be using a genomic fasta instead of a transciptome fasta?

ADD REPLY • link 8 months ago by rfran010 ▴ 900

0

Entering edit mode

What is the difference? How would I find a transcriptome fasta?

ADD REPLY • link 8 months ago by Y • 0

2

Entering edit mode

Genomic fasta contains entire genome sequence. Transcriptome file contains just expressed sequences. You will find the Ensembl version of the Danio rerio transcriptome file here: https://ftp.ensembl.org/pub/release-110/fasta/danio_rerio/cdna/Danio_rerio.GRCz11.cdna.all.fa.gz

This line is a space hog.

ADD REPLY • link 8 months ago by GenoMax 141k

0

Entering edit mode

Is there something wrong with the check_strandedness or the line I fed in:

Results stored in: stranded_test_Control_Sample_1_R1
converting gtf to bed
using /path/to/Danio_Rerio_Transcriptome_Kallisto.idx as kallisto index
creating fastq files with first 200000 reads
quantifying with kallisto

[quant] fragment length distribution will be estimated from the data
[index] k-mer length: 30
[index] number of targets: 60,000
[index] number of k-mers: 69,000,000
[index] number of equivalence classes: 200,000
Warning: 8000 transcripts were defined in GTF file, but not in the index
[quant] running in paired-end mode
[quant] will process pair 1: stranded_test_Control_Sample_1_R1/Control_Sample_1_R1.fq
                             Control_Sample_1_R2_sample.fq
[quant] finding pseudoalignments for the reads ... done
[quant] processed 200,001 reads, 100,000 reads pseudoaligned
[quant] estimated average fragment length: 199.111
[   em] quantifying the abundances ... done
[   em] the Expectation-Maximization algorithm ran for 777 rounds
[  bam] writing pseudoalignments to BAM format .. done
[  bam] sorting BAM files .. done
[  bam] indexing BAM file .. done

checking strandedness
Reading reference gene model stranded_test_Control_Sample_1_R1/Danio_rerio.GRCz11.110.chr.bed ... Done
Loading SAM/BAM file ...  Finished
Total 200000 usable reads were sampled
Traceback (most recent call last):
  File "/path/to/check_strandedness", line 8, in <module>
    sys.exit(main())
  File "/path/to/Python3.10/check_strandedness.py", line 185, in main
    result = pd.read_csv(test_folder + '/' + 'strandedness_check.txt', sep="\n", header=None)
  File "/path/to/pandas/io/parsers/readers.py", line 899, in read_csv
    kwds_defaults = _refine_defaults_read(
  File "/path/to/pandas/io/parsers/readers.py", line 1971, in _refine_defaults_read
    raise ValueError(
ValueError: Specified \n as separator or delimiter. This forces the python engine which does not accept a line terminator. Hence it is not allowed to use the line terminator as separator.

The command I used:

check_strandedness --gtf /path/to/Danio_rerio.GRCz11.110.chr.gtf -fa /path/to/Danio_rerio.GRCz11.cdna.all.fa -r1 Control_Sample_1_R1.fq.gz -r2 Control_Sample_1_R2.fq.gz -k /path/to/Danio_Rerio_Transcriptome_Kallisto.idx

The Callisto index was created using:

kallisto index -i Danio_Rerio_Transcriptome_Kallisto.idx Danio_rerio.GRCz11.cdna.all.fa

What am I doing incorrectly?

ADD REPLY • link 8 months ago by Y • 0

0

Entering edit mode

Are you shuffling files between windows/mac and Unix. It sounds like the problem may be with file endings and can be fixed by dos2unix type utils.

ADD REPLY • link 8 months ago by GenoMax 141k

0

Entering edit mode

No I am on a linux HPC. I unzipped the fasta using gunzip is that the issue?

ADD REPLY • link 8 months ago by Y • 0

0

Entering edit mode

I saw a reply on Biostars here that said that kb ref can be used to create a fasta but then when I need to kallisto index it using

kallisto index -i Danio_Rerio_Transcriptome_Kallisto.idx transcriptome_built_using_kb_python_Danio_rerio.fa

I get the following weird error

[build] loading fasta file transcriptome_built_using_kb_python_Danio_rerio.fa
[build] k-mer length: 31
[build] warning: clipped off poly-A tail (longer than 10)
        from 380 target sequences
[build] warning: replaced 13167 non-ACGUT characters in the input sequence
        with pseudorandom nucleotides
[build] counting k-mers ... Aborted (core dumped)

ADD REPLY • link 8 months ago by Y • 0

0

Entering edit mode

I don't use kallisto so can't say what could be wrong.

You may want to try salmon and let it find out the strandedness as an alternative. See: https://salmon.readthedocs.io/en/latest/salmon.html#what-s-this-libtype

ADD REPLY • link 8 months ago by GenoMax 141k

0

Entering edit mode

Thank you it seems easier just to ask the company for what kit they used to know the strandedness at this point.

ADD REPLY • link 8 months ago by Y • 0

0

Entering edit mode

checkstrand tool (from BBMap suite) question

Another strand-checking tool if you still need one.

ADD REPLY • link 6 months ago by Brian Bushnell 20k