Hi everyone,
I am working on a project about copy number variant calling algorithms. One question I am asking is that if GC content correlated with the success rate of CNV calling, and to answer that I would need the actual sequence. However, I don't have the original bam or bed files; I only get the output file from the different CNV calling algorithms, and a corrected consensus file among all of these algorithms. (It would be hard for me to explain how did I calculate consensus, so I would leave that part out for now.)
The information I have now is basically just chromosome and position on the chromosome.
"example: ch1:1453:20918"
I am thinking about use these positions info to line back to hg19 reference data, and get the exact sequence, then calculate the GC content by hand.
I called the getSequence function in biomaRt package, the call returned a dataframe with two columns: sequence and gene symbol. When I choose seqType="cdna", I was expecting the sum of length of each fragment/gene equal to the length of the calling segment(end-start), but it did not.
ensembl = useMart("ensembl",dataset="hsapiens_gene_ensembl")
result = getSequence(chromosome=1, start=28028201, end=32976095,type="entrezgene",seqType="cdna", mart=ensembl)
#count all the nucleotides
count <- 0
for (i in 1: nrow(result)){
length <- length(unlist(strsplit(result[i,1], split = "")))
count <- count+length
}
end-start
[1] 4947894
count
[1] 16861601
Here is the detailed info about categories in seqType: The type of sequence returned can be specified by the seqType argument which takes the following values:
'cdna': for nucleotide sequences
'peptide': for protein sequences
'3utr': for 3' UTR sequences
'5utr': for 5' UTR sequences
'gene_exon': for exon sequences only
'transcript_exon_intron': gives the full unspliced transcript, that is exons + introns
'gene_exon_intron' gives the exons + introns of a gene;'coding' gives the coding sequence only
'coding_transcript_flank': gives the flanking region of the transcript including the UTRs, this must be accompanied with a given value for the upstream or downstream attribute
'coding_gene_flank': gives the flanking region of the gene including the UTRs, this must be accompanied with a given value for the upstream or downstream attribute
'transcript_flank': gives the flanking region of the transcript exculding the UTRs, this must be accompanied with a given value for the upstream or downstream attribute
'gene_flank': gives the flanking region of the gene excluding the UTRs, this must be accompanied with a given value for the upstream or downstream attribute
Am I understanding the concept or output wrong? Or should I change some parameters to make this work?
Appreciate all your help in advance.
Please use appropriate tags. Your question is about
biomaRt
. That should have been a tag when you created the question. When you add appropriate tags, users that follow the tag (usually experts interested in helping others in that subject matter) get notified of your question, and this means you stand a better chance at getting a relevant, useful response faster.You can edit your question and add the tag now.