To obtain upstream DNA sequence by giving location
3
0
Entering edit mode
4.0 years ago
bioinfo456 ▴ 150

Hi,

I'm trying to obtain upstream sequence of DNA by using getSeq and I'm getting the following error. Please help.

library(BSgenome.Hsapiens.NCBI.GRCh38)
for (i in range(1:142325)){ if (!check.integer(case_chr[i]))next else up_seq[i] <- getSeq(Hsapiens, "case_chr[i]", start = case_loc[i]-1000, end = case_loc[i] - 1)}

Error in .getOneSeqFromBSgenomeMultipleSequences(x, names[i], start[i],  : 
  sequence case_chr[i] not found

I'm assuming I'm facing this issue because getSeq function takes only chromosome number as its first argument and there are no chromosome numbers for a few rows instead of which there is something like CHR_HSCHR1_1_CTG11. Am I supposed to filter them out?

SNP R genome • 2.0k views
ADD COMMENT
3
Entering edit mode
4.0 years ago
Jeffin Rockey ★ 1.2k

Use bedtools flank (link) with -l,-r,-s options set according to your requirement. This will give you upstream/downstream/both co-ordinate file in bed/gff.

Then using that bed/gff use bedtools getfasta (link). Here please note the -s option.

ADD COMMENT
0
Entering edit mode

Thanks. Could you please tell me about what columns the input file is supposed to contain for bedtools flank?

ADD REPLY
1
Entering edit mode

Bedtools for most of the programs like intersect, coverage, flank, getfasta etc accepts bed, gff & vcf format. VCF is not relevant here. GFF and BED expects different things in different columns and also differs by being 1-based and 0-based repectively. A concise explanation is there at this link. Please see.

Now, if you have just chromosome, start position & end position , you can right away create a bed file. For a gene from 1st to 100th base in chromosome one, the bed entry would be like. chr1 0 100 gene1

Take care to use tab as separator.

If you have a gff or bed from some source you may directly use that. In case there is trouble, just update with a snippet of input file and the command used.

ADD REPLY
0
Entering edit mode

I get the following error when I run the following :

bedtools flank -i case1.vcf -g homo_sapiens_GRCh38.vcf -b 1000

Error: The requested genome file could not be opened.

How do I go about this?

ADD REPLY
1
Entering edit mode

You have to provide genome length file after "-g" option instead of .vcf file. In bedtools flank, they have mention the below note: "In order to prevent creating intervals that violate chromosome boundaries, bedtools flank requires a genome file defining the length of each chromosome or contig"

ADD REPLY
1
Entering edit mode

Uday, Genome file here means a file that gives the names and lengths of scaffolds/chromosome in a multifasta.There are numerous programs avaialble for it or you may write a one liner or so. The file extension does not matter.

ADD REPLY
0
Entering edit mode

Thanks. Where can I find the same for the human genome?

ADD REPLY
1
Entering edit mode

I suppose you have a fasta file dowloaded for human genome. If not, download it from Ensembl or any source your purpose necessitates.

Once you have fasta file, then getting the genome file is only a matter of running a small script like from this page

ADD REPLY
0
Entering edit mode

Yes, I do have the fasta file of each chromosome. Is there a way I could avoid using Python and do the same using terminal? Thanks.

ADD REPLY
1
Entering edit mode
samtools faidx input.fa
cut -f1,2 input.fa.fai > sizes.genome
ADD REPLY
1
Entering edit mode

There is a bioawk option also given at the bottom of the mentioned page.

ADD REPLY
0
Entering edit mode

Thank you so much all of you for your quick response. I'm getting the following error on running bedtools : ERROR: chrom "CHR_HSCHR1_4_CTG3" not found in genome file. Exiting. Any solutions for this?

ADD REPLY
0
Entering edit mode

It means chrom "CHR_HSCHR1_4_CTG3" is present in .vcf file but not in genome size file. Try finding this chromosome ID in your genome size file and if it is not there, then add the same in file.

ADD REPLY
0
Entering edit mode

Genome size file contains just the chromosome number followed by its length. I don't think adding it is a viable solution as I have nearly hundreds of such instances.

ADD REPLY
0
Entering edit mode

Please share the output the below two commands in order to identify what could be wrong.

cat genome-file | sort | head -5
cat bedfile | sort | head -5
ADD REPLY
0
Entering edit mode
cat sizes.genome | sort | head -5
10  133797422
11  135086622
12  133275309
1   248956422
13  114364328

    cat case1.vcf | sort | head -5
10  100098591   rs73336163  G   A   .   .   CSQ=A|intergenic_variant|MODIFIER||||||||||||||||||||
10  100128958   rs7073643   G   A   .   .   CSQ=A|intron_variant&non_coding_transcript_variant|MODIFIER|CYP2C23P|ENSG00000283232|Transcript|ENST00000636357|unitary_pseudogene||2/6||||||||||-1||HGNC|HGNC:39970
10  100129320   rs7898975   C   G,T .   .   CSQ=G|intron_variant&non_coding_transcript_variant|MODIFIER|CYP2C23P|ENSG00000283232|Transcript|ENST00000636357|unitary_pseudogene||2/6||||||||||-1||HGNC|HGNC:39970,T|intron_variant&non_coding_transcript_variant|MODIFIER|CYP2C23P|ENSG00000283232|Transcript|ENST00000636357|unitary_pseudogene||2/6||||||||||-1||HGNC|HGNC:39970
10  100131063   rs12260913  C   G   .   .   CSQ=G|intron_variant&non_coding_transcript_variant|MODIFIER|CYP2C23P|ENSG00000283232|Transcript|ENST00000636357|unitary_pseudogene||2/6||||||||||-1||HGNC|HGNC:39970
10  100132172   rs7096636   C   G   .   .   CSQ=G|intron_variant&non_coding_transcript_variant|MODIFIER|CYP2C23P|ENSG00000283232|Transcript|ENST00000636357|unitary_pseudogene||2/6||||||||||-1||HGNC|HGNC:39970

Does this help in rectifying anything? I'm assuming the bed file to be my output with appropriate ranges mentioned in it. The above mentioned vcf file is the input to the bedtool flank.

ADD REPLY
1
Entering edit mode

Yes. It rectifies many things.

Probably you have downloaded the vcf from some source. The human genome fasta upon which the vcf is generated in not the same human genome fasta file upon which the genome file is created.

Solution: Get the genome file generated on the same fasta file on which the vcf is generated for case1. (Try to identify the particular Hg assembly used in generating the VCF and download it)

ADD REPLY
0
Entering edit mode

Yes indeed, the vcf file I have is an output of Ensembl's Variant effect predictor (VEP GRCh38.p12 assembly), while genome fasta file information is taken from ncbi (same assembly). Any alternatives other than filtering such SNPs?

ADD REPLY
1
Entering edit mode

Looks like even though same assembly from NCBI some thing different in Ensembl.

Alternative.

Go to ensembl page . GRCh38.p12 itself should be there. Go to the download FASTA option. Once downloaded,cross check the first column values in this fasta and your vcf.

If good to go, make genome file with this fasta and redo flank.

ADD REPLY
0
Entering edit mode

First column values in fasta? I'm sorry I didn't get you, could you please elaborate? You want me to download genome fasta chromosome wise? And on finding its length, it's still the same length.

ADD REPLY
1
Entering edit mode

My mistake. I intended the headers of the fasta and the first column of vcf.

To elaborate I expect all the missing entries to be present in the fasta from Ensembl.

Do not download individual chrs. Instead go for the toplevel fasta.

Once downloaded run the script for genome file with lengths,then check the first columns of that with that of the vcf. That should do. If all uniq entries of first column of vcf is not there in genome file then something else still need to be sorted out.

Most likely things will be fine with the Ensembl fasta. Please check.

ADD REPLY
0
Entering edit mode

Alright. Will do as you told and get back to you in sometime.

ADD REPLY
0
Entering edit mode

I'm getting the following error when i try to run the script for genome file using bioawk :

bioawk: can't open file Homo_sapiens.GRCh38.dna._sm.chromosome.1.fa
     source line number 1
ADD REPLY
1
Entering edit mode
ADD REPLY
0
Entering edit mode

Worked! Thanks a ton. Now I have the following :

1   1067800 rs55746161
1   1068802 rs55746161
1   1129419 rs11580120
1   1130421 rs11580120
1   1129716 rs61766345
1   1130718 rs61766345

Just a minor hiccup, how can I obtain it in proper .bed format which can be passed as an argument to bedtools getfasta? As in chr start stop format?

ADD REPLY
0
Entering edit mode

What is the snippet pasted?

ADD REPLY
0
Entering edit mode

Output of bedtools flank. But apparently the input format for bedtools getfasta is chr start stop.

ADD REPLY
0
Entering edit mode

Please share the exact flank command used.

ADD REPLY
0
Entering edit mode

Problem solved! Thanks :).

ADD REPLY
0
Entering edit mode

Do you know what happens if bedtools flank runs into a neighbouring feature? Does it consider/report it somehow?

ADD REPLY
0
Entering edit mode

I have not seen any reporting/error/warning.

While using flank for promoters the scenario happens all the time.

If you take 1kb or 2kb upstream of a gene start, in many plant genomes I have noticed that it would run inside the gene region of some other gene.

Usually there is limiting if the coordinate is going beyond the chromosome/scaffold length, for which purpose, genome file is made required.

ADD REPLY
0
Entering edit mode

I see, could be even more extreme in case of fungi with their dense genomes. That'd be a nice option do add for bedtools.

ADD REPLY
1
Entering edit mode
4.0 years ago
Tm ★ 1.1k

You can give a try to seqkit subseq with -d (for downstream seq) and -u (for upstream sequence) option or "--only-flank" option to get only up/down stream sequence based on your requirement.

ADD COMMENT
1
Entering edit mode
4.0 years ago
ATpoint 66k

If you want it in R for any given coordinate:

require(BSgenome.Hsapiens.UCSC.hg38)
require(Biostrings)

GetSeqs <- function(BSgenome, Coords){

  chr <- strsplit(Coords, split=":")[[1]][1]
  tmp <- strsplit(
         strsplit(Coords, split=":")[[1]][2], split="-")[[1]]
  Start <- tmp[1]
  End   <- tmp[2]
  return( getSeq(BSgenome, chr, as.numeric(Start), as.numeric(End))) 
}

## Example, requiring coordinates in UCSC format:
GetSeqs(BSgenome.Hsapiens.UCSC.hg38, "chr3:187745272-187745283") # ATTTGGCAAGAG

You can also directly copy the sequence to the clipboard: in case that is intended:

require(BSgenome.Hsapiens.UCSC.hg38)
require(Biostrings)
require(clipr)

GetSeq2Clipboard <- function(BSgenome, Coords){

  chr <- strsplit(Coords, split=":")[[1]][1]
  tmp <- strsplit(
          strsplit(Coords, split=":")[[1]][2], split="-")[[1]]
  Start <- tmp[1]
  End <- tmp[2]
  clipr::write_clip(toString(getSeq(BSgenome, chr, as.numeric(Start), as.numeric(End))))
}

## Example:
GetSeq2Clipboard(BSgenome.Hsapiens.UCSC.hg38, "chr3:187745272-187745283")
ADD COMMENT
0
Entering edit mode

Thanks ATpoint. I'd still like to understand working with bedtools. Your help is much appreciated.

ADD REPLY

Login before adding your answer.

Traffic: 2662 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6