Generate a counts matrix with paired-end non-stranded samples
1
1
Entering edit mode
6.1 years ago
kpr ▴ 80

Newbie question here, sorry. I am trying to generate a counts matrix for paired-end non-stranded data. I feel pretty good about the steps I have taken so far, but this is the first time I have used this particular type of data. The counts matrix I have generated has allele specific reads which I wasn't anticipating, and the counts seem vary off, especially between the two alleles. When generating a counts matrix with allele specific information, is there any special steps I need to consider? Also is there a way to get non-allele specific read counts?

I think a potential issue maybe with samtools or ht-seq. Feedback appreciated. I am willing to switch to bioconductor/R tools also. I don't have too much experience outside with RNA-seq outside of these programs though.

samtools sort -n accepted_hits.bam  sample_sorted
samtools view sample_sorted.bam >sample.sam

htseq-count -s no --idattr=Parent sample.sam organism.gff > sample.counts

What would be helpful for me to know: Does it look like I'm on the right track? If yes, why are my allele reads so different? If no, what steps did I overlook, and do you have suggestions?

RNA-Seq R software error rna-seq gene • 2.4k views
ADD COMMENT
0
Entering edit mode

If yes, why are my allele reads so different?

Different how? Could you post an example?

What is the organism and are you mapping to a reference genome containing allele-specific information?

ADD REPLY
0
Entering edit mode

I have a wild, mutant1, and mutant2, 4 replicates each, below is a subset of the counts matrix that was generated. The rows are genes, and I think the A and B allele are marked at the end of the gene name. This is only a small subset, but there are two concerns. The first is that I would expect the alleles to be fairly similar, in some cases the reads between alleles differ by several hundred counts. The second concern is I will occasional see an extreme observation (such as C2_09060C) in which one sample's count differs by several thousand for one replicate but not the others. For example, for C2_09060C Wild, Mutant1, and Mutant2 had about the same numbers for all observations and replicates expect for one sample (Mutant2 below). There are quite a few occurrences of this.

                 Wild    Mutant1  Mutant2
C2_09050C_A       527        517      264
C2_09050C_B       512        458      271
C2_09060C_A     12545      12171     4882
C2_09060C_B     11306      11292     4447
C2_09070C_A       283        287      284
C2_09070C_B         2        292      311
C2_09080C_A         0          0        0
C2_09080C_B         0          0        0
ADD REPLY
0
Entering edit mode

Candida albicans, and I believe the reference genome contains allele specific information.

ADD REPLY
0
Entering edit mode

Indeed, it seems the reference genome assembly is diploid. I don't think HTSeq is the right tool for the job, I would search for allele-specif expression pipelines, or read some papers dealing with RNAseq analysis of Candida albicans.

ADD REPLY
0
Entering edit mode

If I use a haploid reference genome, will it fix the issue?

ADD REPLY
1
Entering edit mode

See my answer to your earlier post C Albicans Genome Count Matrix Total Genes: by selecting just one chromosome set, you may be leaving out a small number of features (and possibly of reference DNA, if there are structural variants between chromosome sets A and B). How small are the differences, I don't know.

ADD REPLY
3
Entering edit mode
6.1 years ago

WARNING! Phased genome :)

As you have correctly noticed, C. albicans is one of few species which has a phased genome, i.e. sequence from both parental chromosomes (but keep in mind that in particular case of C. albicans we don't know if all A chromosomes are from one parental, and B chromosomes from another, here is the paper that generated this assembly is you are curious).

To my knowledge, today most of the software and pipelines (especially mappers and variant callers) are generally not well adapted in dealing with phased data. So if you are not really interested in allele-specific expression, its better to use reference genome and annotations from one parental (just subset fasta file and gff). If you are interested in allele-specific expression and will use phased genome, then you should expect to see a huge number of multimapped reads, because parental are very similar in DNA sequence. To avoid this at some extent you can restrict the number of allowed mismatches in settings of your mapper for example to 1 or 0 (in latter case you will miss the reads with sequencing errors though). In STAR mapper, these options are --outFilterMismatchNmax, --outFilterMismatchNoverLmax. More details are in manual.

And to speed up things, I would recommend you to use STAR mapper with option --quantMode GeneCounts, it will automatically generate count table of reads identical to HTSeq.

ADD COMMENT
0
Entering edit mode

THANKS! Sorry if this is a dumb question, but this is my first phased genome. I am not interested in allele-specific expression. How can I subset the fastq and gff file to only just get one allele? Your help is appreciated :)

ADD REPLY
0
Entering edit mode

For reference genome taken form CGD, you can use this simple script:

import re
import sys
import os
from Bio import SeqIO


A_hapl = open("A_hapl.fasta", "w")
B_hapl = open("B_hapl.fasta", "w")

for seq_record in SeqIO.parse(sys.argv[1], "fasta"):
    name = strseq_record.id)
    seq = str(seq_record.seq)
    if re.search(r'.*chr.A.*',name):
        A_hapl.write("%s%s\n"%(">",name))
        A_hapl.write("%s\n"%(seq))
    elif re.search(r'.*chr.B.*', name):
        B_hapl.write("%s%s\n"%(">",name))
        B_hapl.write("%s\n"%(seq))
    else:
        A_hapl.write("%s%s\n"%(">",name))
        A_hapl.write("%s\n"%(seq))
        B_hapl.write("%s%s\n"%(">",name))
        B_hapl.write("%s\n"%(seq))

A_hapl.close()
B_hapl.close()

It will generate files with A haplotype and B haplotype (A_hapl.fasta, B_hapl.fasta, respectively).

For gff file from CGD, you just can do like:

grep -e "A_C_albicans" -e "chrM_C_alb" C_albicans_SC5314_A22_current_features.gff > A_hapl.gff

Hope this helps

ADD REPLY
0
Entering edit mode

This is helpful! I have a fastq file, but I think I get the general idea.

ADD REPLY
0
Entering edit mode

So basically you need to remap your reads to one of the hyplotypes.

ADD REPLY
0
Entering edit mode

I created the haploid gff file. but It's not clear to me how to use the script. Can I use R?

ADD REPLY
1
Entering edit mode

It is not clear to me what your question is about. Could you open a new thread, and explain carefully what you have done so far, and what you are attempting to accomplish?

ADD REPLY
0
Entering edit mode

Thanks for your comment, I was having the same question as this one. I was trying to get the haploid genome by using this script but I discovered a missed ( in the script and it's solved now name = strseq_record.id) should be name = str(seq_record.id)

ADD REPLY
1
Entering edit mode

IN R:

install.packages("seqinr")
library(seqinr)

# Read in fasta file 
fileName = 'CA.fasta'
the.file = read.fasta(fileName)

M = grep("M_", names(the.file))
B_alleles = grep("B_", names(the.file))
A_alleles = grep("A_", names(the.file))

write.fasta(the.file, names(the.file)[sort(c(B_alleles, M))], "CA_A.fasta", open = "w")
write.fasta(the.file, names(the.file)[sort(c(A_alleles, M))], "CA_B.fasta", open = "w")
ADD REPLY
0
Entering edit mode

Ok, thank you .. I got help with paython and it worked but there was a typo on name = strseq_record.id) should be name = str(seq_record.id)

but I'll try this one on R also :)

ADD REPLY
0
Entering edit mode

There is a typo on paython as well...

ADD REPLY

Login before adding your answer.

Traffic: 1959 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6