Question

Loop through file to assign alphabets to genefeatures for the read intersects

0

Entering edit mode

11 months ago

L_bioinfo • 0

I have bed intersect file wherein the gene feature is intersected by multiple reads. I assumed based on similarity I can filter the data however tools like blat and magic-blast didn't help as the reads span more than 5000 bp and covered outside of the gene feature / or covers the gene feature in fragments.

Now I wish to retrieve the fasta sequence of gene features that has been covered by the read. any suggestions on how to select the coordinates?

Additionally, if a gene feature is covered multiple fragments of reads at different coordinates without overlaps is there a way to assign alphabets to the gene feature ID?

File1:

<contig_name> <querystart> <querystop><queryID> <querystrand_direction> <featurestart> <featurestop> <featureID> <feature_strand>

input

contig    13000    14000    pac34    +    10000    16000     ID_84    +
contig    14500    15784    pac75    +    10000    16000     ID_84    +

out

contig    13000    14000   pac34    +     10000     16000    ID_84a    +
contig    14500    15784   pac75    +     10000     16000   ID_84b    +

bedintersect python awk • 413 views

ADD COMMENT • link 11 months ago by L_bioinfo • 0

0

Entering edit mode

Now I wish to retrieve the fasta sequence of gene features that has been covered by the read. any suggestions on how to select the coordinates?

You can use samtools faidx to retrieve sequences from the fasta file. http://www.htslib.org/doc/samtools-faidx.html

Additionally, if a gene feature is covered multiple fragments of reads at different coordinates without overlaps is there a way to assign alphabets to the gene feature ID?

What do you mean by assigning alphabets?

<contig_name> <querystart> <querystop> <querystrand_direction> <featurestart> <featurestop> <featureID> <feature_strand>

Your file has 9 columns but your header has 8 names.

ADD REPLY • link 11 months ago by barslmn ★ 2.1k

0

Entering edit mode

I used getfasta to retrieve sequence however i am not sure about selecting the coordinates in case a gene feature is hit by different reads at the same location. It would mean I should compare the similarity however tools like blat/magic-blast returns identity only for a small percentage of sequence in the mapped region.

I have edited the question for the requirement.

ADD REPLY • link 11 months ago by L_bioinfo • 0