Mapped reads not mapping to a real sequence?
0
0
Entering edit mode
14 months ago
txema.heredia ▴ 110

Hi,

I've been passed down some bam files from RNA-seq that I need to analyze. They were generated with pseudoalignments using Kallisto ( https://pachterlab.github.io/kallisto/about ) When I run htseq-count one specific bamfile crashes on a specific read:

Error occured when processing input (record #149548485 in file 103805-016-011.kallisto.pseudoalignments.bam):
Expected str, got NoneType   [Exception type: TypeError, raised in _HTSeq.pyx:60]

After delving deeper I realized that the next read (or that specific one if it is 0-based) has some very weird formatting, at least to me:

[149548484]   A00379:576:H7WK3DSX3:4:1101:5421:1016     77    *     0     0     *     *     0     0     GNTCTTTTAAAAAGAGATTAAACCGAAGGTGATTAAAAGACCTTGAAATCCATGACGCAGGGAGAATTGCGTCATTTAAAGCCTAGTTAACGCATTTACTAAACGCAGACGAAAATGGAAAGATTAATTGGGAGTGGTAGGATGAAACAAT     F#FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
[149548485]   A00379:576:H7WK3DSX3:4:1101:5421:1016     141   *     0     0     *     *     0     0     CTTCCTACTTTTCAGGTTTAAATTTATCTTTTTTCTTCTAAAAGTATGTTTTTATCTTCTAATTTCCCTATCTTCTCTATTCTTTTCTTCGCCTTCCCGTACTTCTGTCTTCCAGTTTTACACTTCAAACTTCTATCTTCTCCAAATTGTT     FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF,:FFFFFFFF:FFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFF,FFFFFFF,FFFFF:FFF:FF,FFF:FF,F,FFFFF,:FF
[149548486]   A00379:576:H7WK3DSX3:4:1101:13702:1016    67    *     0     0     151M  *     0     0     ANAAATCTAGGCTCCATCAACACTGAATTGCAAGATGTGCAGAGGATCATGGTGGCCAATATTGAAGAAGTGTTACAACGAGGAGAAGCACTCTCAGCATTGGATTCAAAGGCTAACAATTTGTCCAGTCTGTCCAAGAAATACCGCCAGG     F#FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFF:F:FFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFF     ZW:f:0
[149548487]   A00379:576:H7WK3DSX3:4:1101:13702:1016    131   *     0     0     151M  *     0     0     TATTTTCAGGAAACTGAGCTCACAGAGATGTGTATTAGAATCCAAGTGGAACTTCTGCCTCTAAAGACCTTGCAAGAAAAGAGATGCCCTGAAAATGAAAGGTTGCACCTCATTTAATGAAGCTTAACCCTATGTAGAAAGTCTCTTTCGG     F:FFFFFFFFF:FFFFF,FFFFFFFFFFFFFFFFF:FFFFFFF:FFFFF:FFFFFFFFFFFFFFFFFFF:FFFFFFF:FFFFFFFFFFFFFF:FFFFFFFFFFFFFFFF,FFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF     ZW:f:0
[149548488]   A00379:576:H7WK3DSX3:4:1101:14913:1016    77    *     0     0     *     *     0     0     ANTATAACAAACCCTGAGAACCAAAATGAACGAAAATCTGTTCGCTTCATTCATTGCCCCCACAATCCTAGGCCTACCCGCCGCAGTACTGATCATTCTATTTCCCCCTCTATTGATCCCCACCTCCAAATATCTCATCAACAACCGACTA     F#FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFF,::,F::,FF:F:FFFFF,FF:,,::FF:F,F:,,:FFFFFF:F:F::F:FFFFFF,
[149548489]   A00379:576:H7WK3DSX3:4:1101:14913:1016    141   *     0     0     *     *     0     0     GTTTATAGATAGTTGGGTGGTTGGTGTAAATGAGTGAGGCAGGAGTCCGAGGAGGAGGTTAGTTGTGGCAATAAAAATGATTAAGGATACTAGTATAAGAGATCAGGTTCGTCCTTTAGTGTTGTGTATGGTTATCATTTGTTTTGAGGTT     FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFF:F:FFFFFFFFFFFFFFFFFFF:FF:FFFFFFFFFFFFFFFFFFF::FFFF:F:FFFFFFFFFFFF::FFF,FFFFFFFFFFFFFFFFFFFF:FFFFFFFF,:F

The read pair at lines 149548486-149548487 are flagged as being mapped, but they aren't mapped to any known sequence (RNAME is *).

How can I prevent such error?

bam sam samtools kallisto • 1.7k views
ADD COMMENT
2
Entering edit mode

I cannot help with that error in particular (@dsull is involved in kallisto, he might help here) but more generally: kallisto BAM files have coordinates in transcriptome space afaik (as reads were quantified against a transcript fasta file) but HTseq uses a GTF for guidance, and that is probably in genome coordinate space. Are you aware of that? Don't you have the original kallisto quant output? It's a bit unusual to quantify the BAM files. A better way could be to revert BAM to fastq and just run kallisto again. It's quite fast and not memory-hungry.

ADD REPLY
0
Entering edit mode

Kallisto should do a smarter and more sophisticated job of counting genes than htseq-count, so this whole process seems backwards.

ADD REPLY
1
Entering edit mode

the point I think was that kallisto may be good at counting, but not so clear that is good at generating the correct BAM file. The pseudoBAM files are created from a kmer classification process, it is not so easy to create these correctly, especially when the various splicing, read strand, and clipping need to be accounted for.

I know that I have discovered several bugs in pseudoBAMs in the past.

Like ATpoint I would not recommend processing pseudoBAM files, these should be used for visualization. but if one needs actual counts they should be using the counting functionality rather than counting the BAM with HTSEQ

ADD REPLY
0
Entering edit mode

Thanks, I wasn't aware of that.

However, when checking the bam file header, I see it references to chromosomes 1-22 + X/Y/MT. And when checking the very last read at chromosome 1 it is mapped to a position very close the the chromosome end:

A00379:576:H7WK3DSX3:4:1640:15926:7686    355   1     249231051   255   151M  =     249231029   129   AAGAAAACAACATGCTTGTGTTCACTGTGGATGTTAAAGCCAACAAGCACCAGATCAGACAGGCTGTGAAGAAGCTCTATGACAGTGATGTGGCCAAGGTCACCACCCTGATTTGTCCTGATAAAGAGCAGATCGGAAGAGCACACGTCTG     FFFFFFFFFFFFF:FFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF     ZW:f:0.0164623

I do have, indeed, the abundance files from kallisto. However, they are quantifying transcripts and I was asked to analyze (at least to start with) differentially expressed genes through DESeq2.

These kallisto results, however, were produced using a very old annotation file:

$ cat 103805-016-011.kallisto.run_info.json

{
        "n_targets": 180253,
        "n_bootstraps": 0,
        "n_processed": 79902493,
        "n_pseudoaligned": 72126238,
        "n_unique": 23983580,
        "p_pseudoaligned": 90.3,
        "p_unique": 30.0,
        "kallisto_version": "0.44.0",
        "index_version": 10,
        "start_time": "Tue Mar  1 16:30:14 2022",
        "call": "kallisto quant --index=references/transcriptome/Homo_sapiens.GRCh37.cdna.all.kallisto.idx --output-dir=103805-016-011_kallisto_output --plaintext --fusion --pseudobam --genomebam --threads=4 --gtf=references/annotation/Homo_sapiens.GRCh37.87.chr.gtf.gz 103805-016-011_R1.fastq.gz 103805-016-011_R3.fastq.gz"
}

I also have, for each sample, another bam file generated through histat2. However, I never used histat2 and I cannot find neither the command used to generate it, nor the annotation file version used.

$ samtools view -H 103805-016-011_dupmarked.bam | grep "@PG"
@PG   ID:hisat2   PN:hisat2   VN:2.2.1    CL:"/mnt/data/rna-sequencing/intermediate/103805-016/hisat2/.snakemake/conda/127b78b4/bin/hisat2-align-s --wrapper basic-0 -x /mnt/data/references/genome/ucsc.hg19_nohap_masked -p 4 --rg-id RNA --rg LB:NEBNextUltraDirectionalRNA --rg PL:Illumina --rg PU:platform_unit --rg SM:103805-016-011 --read-lengths 151 -1 /tmp/1759625.inpipe1 -2 /tmp/1759625.inpipe2"
@PG   ID:MarkDuplicates VN:Version:4.1.9.0      CL:MarkDuplicates --BARCODE_TAG RX --INPUT 103805-016-011.bam --OUTPUT 103805-016-011_dupmarked.bam --METRICS_FILE 103805-016-011_duplicate_metrics.txt --MAX_SEQUENCES_FOR_DISK_READ_ENDS_MAP 50000 --MAX_FILE_HANDLES_FOR_READ_ENDS_MAP 8000 --SORTING_COLLECTION_SIZE_RATIO 0.25 --TAG_DUPLICATE_SET_MEMBERS false --REMOVE_SEQUENCING_DUPLICATES false --TAGGING_POLICY DontTag --CLEAR_DT true --DUPLEX_UMI false --ADD_PG_TAG_TO_READS true --REMOVE_DUPLICATES false --ASSUME_SORTED false --DUPLICATE_SCORING_STRATEGY SUM_OF_BASE_QUALITIES --PROGRAM_RECORD_ID MarkDuplicates --PROGRAM_GROUP_NAME MarkDuplicates --READ_NAME_REGEX <optimized capture of last three ':' separated fields as numeric values> --OPTICAL_DUPLICATE_PIXEL_DISTANCE 100 --MAX_OPTICAL_DUPLICATE_SET_SIZE 300000 --VERBOSITY INFO --QUIET false --VALIDATION_STRINGENCY STRICT --COMPRESSION_LEVEL 2 --MAX_RECORDS_IN_RAM 500000 --CREATE_INDEX false --CREATE_MD5_FILE false --GA4GH_CLIENT_SECRETS client_secrets.json --help false --version false --showHidden false --USE_JDK_DEFLATER false --USE_JDK_INFLATER false   PN:MarkDuplicates PP:hisat2
@PG   ID:samtools PN:samtools PP:MarkDuplicates VN:1.13     CL:samtools view -H 103805-016-011_dupmarked.bam

The mapping should be, at least a little, splicing-aware. Exploring and comparing the results of the kalisto and histat2 bam shows reads spanning multiple exons. However, IGV doesn't show a junctions track for the histat2 file and it has much more "noise" in introns than the kalisto-generated file.

Could you help me quantify genes* for DESeq2 with the data I currently have?

ADD REPLY
1
Entering edit mode

what if the output of the following commands

file 103805-016-011.kallisto.pseudoalignments.bam

and

samtools view -H 103805-016-011.kallisto.pseudoalignments.bam | grep SQ | head
ADD REPLY
0
Entering edit mode

Thanks,

When re-checking everything I realized that other samples also have a similar problem. For some weird reason running htseq-count in all samples it complains always about sample 011, but when running it with another single sample (001), it also crashes in a different (earlier!) line:

$ tail nohup.htseq.test01.out
[...]
45000000 alignment record pairs processed.
Error occured when processing input (record #93268283 in file 103805-016-001.kallisto.pseudoalignments.bam):
  Expected str, got NoneType
  [Exception type: TypeError, raised in _HTSeq.pyx:60]

Which matches as before the very first line where FLAG == 67 or 131, RNAME == and CIGAR != in that sample:

$ samtools view103805-016-001.kallisto.pseudoalignments.bam | awk -v OFS="\t" '{  if(($2==67 || $2==131) && $3=="*" && $6 != "*") {print NR, $0}}' | head -2
93268284    A00379:576:H7WK3DSX3:4:1101:7111:1125     67    *     0     0     151M  *     0     0     CATGGCGCCCCGAACCCTCGTCCTGCTACTCTCGGGGGCTCTGGCCCTGACCCAGACCTGGGCGGGCTCTCACTCCATGAGGTATTTCTTCACATCCGTGTCCCGGCCCGGCCGCGGGGAGCCCCGCTTCATCGCAGTGGGCTACGTGGAC     FFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF,FFFFFFFFFFF,FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF     ZW:f:0
93268285    A00379:576:H7WK3DSX3:4:1101:7111:1125     131   *     0     0     151M  *     0     0     TCATCGCAGTGGGCTACGTGGACGACACGCAGTTCGTGCGGTTCGACAGCGACGCCGCGAGCCAGAGGATGGAGCCGCGGGCGCCGTGGATAGAGCAGGAGGGTCCGGAGTATTGGGACGGGGAGACACGGAAAGTGAAGGCCCACTCACA     FFFFFFFFFFFFFF,FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF,FFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF     ZW:f:0

The command you asked :

$ file 103805-016-011.kallisto.pseudoalignments.bam
103805-016-011.kallisto.pseudoalignments.bam: Blocked GNU Zip Format (BGZF; gzip compatible), block length 236
$ file 103805-016-001.kallisto.pseudoalignments.bam
103805-016-001.kallisto.pseudoalignments.bam: Blocked GNU Zip Format (BGZF; gzip compatible), block length 236

And the bam file header (identical in both files) has only chromosomes 1-22 + X/Y/MT:

$ samtools view -H 103805-016-011.kallisto.pseudoalignments.bam | grep SQ
@SQ   SN:1  LN:536870911
@SQ   SN:2  LN:536870911
@SQ   SN:3  LN:536870911
@SQ   SN:4  LN:536870911
@SQ   SN:5  LN:536870911
@SQ   SN:6  LN:536870911
@SQ   SN:7  LN:536870911
@SQ   SN:X  LN:536870911
@SQ   SN:8  LN:536870911
@SQ   SN:9  LN:536870911
@SQ   SN:10 LN:536870911
@SQ   SN:11 LN:536870911
@SQ   SN:12 LN:536870911
@SQ   SN:13 LN:536870911
@SQ   SN:14 LN:536870911
@SQ   SN:15 LN:536870911
@SQ   SN:16 LN:536870911
@SQ   SN:17 LN:536870911
@SQ   SN:18 LN:536870911
@SQ   SN:20 LN:536870911
@SQ   SN:19 LN:536870911
@SQ   SN:Y  LN:536870911
@SQ   SN:22 LN:536870911
@SQ   SN:21 LN:536870911
@SQ   SN:MT LN:536870911

I don't see anything specially wrong with these files. Do you?

ADD REPLY
1
Entering edit mode

flags 67 and 131 mean

0x43    67  PAIRED,PROPER_PAIR,READ1
0x83    131 PAIRED,PROPER_PAIR,READ2

how can it be proper pair when both align on the forward strand ... right there a contradiction

as I mentioned before pseudoBams are rarely robust - I advise against treating them as real BAM files,

a pseudoBAM is more of a convenience feature of the tool rather than a primary result

ADD REPLY
0
Entering edit mode

Thank you

I wanted to use htseq-count precisely because I know it can properly handle ambiguous paired reads and things like that. I was also handled "bedtools multicov" count files, but I don't trust those results at all.

a pseudoBAM is more of a convenience feature of the tool rather than a primary result

My problem is... I don't have the original fastq files. I could generate them from the kallisto bams but... can I trust them?

ADD REPLY
1
Entering edit mode

To be honest I don't quite understand what you are trying to do. Why would you need to use htseq count in the first place? The bam file obtained from kallisto is not the kind of BAM file that a genomic quantification would produce. We use htseq count when the alignment reference is a genome. We cannot use the same tool when the alignment reference is a transcriptome.

(Other people also already mentioned this issue in a comment)

What coordinates were you using anyway?

When you run kallisto, the sequence names in your pseudobam will be the transcript names already. You get the counts right away from that file. There is no need to run htseq to count anything,

For example samtools idxstats can tell you how many alignments you have for each transcript.

TLDR: You can use htseq-count for BAM file generated with hisat2 alignment but not with the one generated with kallisto, so getting an error there is maybe doing you a favor ;-) and keeps you from doing something incorrect

ADD REPLY
0
Entering edit mode

For example samtools idxstats can tell you how many alignments you have for each transcript.

$ samtools idxstats 103805-016-011.kallisto.pseudoalignments.bam
1   536870911   11036342    185984
2   536870911   12651146    163834
3   536870911   7759956 119228
4   536870911   4175828 58332
5   536870911   7501100 100789
6   536870911   5893809 74734
7   536870911   11542311    135931
X   536870911   4463352 55458
8   536870911   4393963 56305
9   536870911   5473999 77598
10  536870911   4826989 65788
11  536870911   8769357 112596
12  536870911   8675621 107403
13  536870911   2009532 30312
14  536870911   3416820 54566
15  536870911   6097418 73323
16  536870911   4564179 65592
17  536870911   10270399    159252
18  536870911   1075454 17998
20  536870911   2934014 51186
19  536870911   8776813 127836
Y   536870911   79937   24157
22  536870911   3407136 45913
21  536870911   2037153 31253
MT  536870911   5635325 85150
*   0   0   16126090

To me it looks like the kallisto bams are generated in chromosome space, not in transcriptome space. Yes, the gene lengths are totally wrong because they didn't add the -c flag when running kallisto:

-c, --chromosomes             Tab separated file with chromosome names and lengths
                              (optional for --genomebam, but recommended)


To be honest I don't quite understand what you are trying to do.

Let me explain the situation and the analysis:

A colleague from the department approached me asking to do an "urgent" analysis for a collaboration of differentially expressed genes "similar to what you have already done for us before" (RNA-seq, STAR + htseq-count + DESeq2 analysis at the gene level across 2 conditions).

According to them (wet lab people, no bioinformatics background) it should be very fast because their collaborator had already sent them count files. Thus, I should only do the DEG analysis and not align the fastqs and yadda yadda.

The analysis is very underpowered. There are 2 control samples, 2 case samples with mutation A, and 1 case sample with mutation B. My colleagues were also hoping to get some extra controls generated by the same people, but we still don't know anything about it.

There is no communication between me and the source of this data. This data was generated by the clinical genetics department of a hospital, so they know what they are doing, but I am not familiar with the software they used.

What I ended up getting is:

  • Kallisto
    • abundance files (transcript level)
    • pseudobams (the ones that crash htseq-count. 100-120M reads per file)
  • Hisat2
    • bams (I think no gtf file was used to generate them by reading the commands on @PG in the bam headers. ~110M reads per file)
  • GATK
    • vcfs (irrelevant for my analysis. However, in their header I see that Haplotype Caller was called using "103805-016-011_dupmarked.split.bam". An extra "split" step performed on the hisat2 bams that I don't have in my data)
  • Count files
    • gene, exon, and intron counts. Raw counts, TMM, TPM and norm40M versions for each of them.
    • created from the hisat2 bam files.

According to an email, the count files were generated by running :

bedtools multicov –q 1 –split –bed {bedfile} –bams {bamfile}

but I have no "proof" of the exact command used to generate them. I don't know neither which hisat2 bam file was used (pre- or post-splicing) nor what annotation file version was used. I was provided with a bed file with 54,475 gene/transcript entries, though.


Previously, for a different project, I had conflicting experiences with bedtools multicov as gene quantification method. Some poorly aligned/ambiguous reads will map across exons from two genes and multicov will count them in, while htseq-count will identify them as problematic and kept them out of the analysis. Because of that, I want to avoid using these provided counts files.

The kallisto abundance files look fine, but they are transcript-based. There are 180,254 different transcripts.

With such a small sample size in the project design, can I reliably test individual transcripts with 151bp paired-end reads?

However, I was asked to run gene-level DEG analysis. Can I really use these transcript-level counts to infer DE Genes? I have no experience with kallisto. Can I add up the counts of all transcripts of a gene to create a count of the gene? What about reads that could belong to multiple transcripts? Are they counted multiple times? Is PCR amplification corrected somehow?

Or should I just use tximport to load the kallisto abundance files into DESeq2 and follow a tutorial?

On top of all this, I do suspect that this data was sequenced originally with UMIs (the kallisto command used R1 & R3 .fastq files, and the sample naming convention matches that of a sequencing company I've used for other projects with UMIs to get rid of PCR-amplification bias). If my best option is to convert the kallisto pseudobams into fastqs, I'd better start from scratch with the original files.

As you see, I've been given a halfway project done using tools that I am not familiar with and I was trying to reverse-engineer what has been done and how can I make it work with the tools I'm used to.

ADD REPLY
1
Entering edit mode

Use tximport to get the kallisto files into DESeq. Bedtools is not the best way to get gene counts.

ADD REPLY
1
Entering edit mode

kallisto should not be run a genome sequence. It was not designed for that; I would be surprised if it worked.

I would guess your files are mislabeled as kallisto output when in fact were generated via hisat. Your Kallisto json shows a transcriptome file there.

If you have kallisto output can easily combine transcript-level counts into gene-level counts using tximport or many other techniques; no need to recount

ADD REPLY
0
Entering edit mode

Thank you all, I learned a lot.

ADD REPLY

Login before adding your answer.

Traffic: 1842 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6