Question: Convert from fasta to fastq
3
gravatar for rere.adel2012
3.2 years ago by
rere.adel201240 wrote:

I have files of data that have Many of the columns but i need sequence of DNA ,I want to convert them to fasta then convert them to fastq or convert it directly to fastq , this files reach to gigabyte, what I must do ???? This is the data we have (tsv format) we need to convert it to FastQ this is screenshot of it

dna format sequence • 2.6k views
ADD COMMENTlink modified 3.2 years ago by genomax70k • written 3.2 years ago by rere.adel201240

this is unclear

ADD REPLYlink written 3.2 years ago by Pierre Lindenbaum122k

please post the formatting of the data as you currently have it

ADD REPLYlink written 3.2 years ago by Vincent Laufer1.1k

This is the data we have (tsv format) we need to convert it to FastQ that screenshot of it

ADD REPLYlink modified 3.2 years ago by genomax70k • written 3.2 years ago by rere.adel201240

Based on your Excel table, i'm very concerned that you are trying to do something that you probably shouldn't, and if we knew more about where this data came from and what you want to do with it, we'd be able to help you in other ways than fasta -> fastq.

So, I don't think your Excel table is in a common format (that i know of) that can be converted directly into FASTA, so that will be your biggest problem. You may have to cut out just the sequences you want to use from Excel - assuming you know what each of the allele columns means because i dont - and save them as a plain .txt. Then, perhaps someone can write you a script to convert that into FASTA.

However, one thing no one can help you do is convert FASTA to FASTQ. The FASTQ is the FASTA + the quality score from the sequencer. That sequencing-quality information doesnt seem to be in your Excel table, and so I don't think that will be possible at all :(

ADD REPLYlink written 3.2 years ago by John12k
1

we downloaded the data from completegenomics website( Cancer Data Set) . we need to do alignment on the dna sequences using sparkbwa so we need to have the data in fastq format then using filters to get variant and mutations , that is my graduation project what i must do ??? :(

ADD REPLYlink written 3.2 years ago by rere.adel201240

For some reason I can't see the examples you posted. But if you downloaded data from Complete Genomics, aren't those already variant calls in tabular format?

ADD REPLYlink written 3.2 years ago by WouterDeCoster40k
1

You should be able to see the shared image from google drive now (check a few posts up).

ADD REPLYlink written 3.2 years ago by genomax70k

no the variant calls are in another file, this file is supposed to contain the sequence

ADD REPLYlink written 3.2 years ago by rere.adel201240
1

Please copy and paste a couple of example rows (change the sequence/identifiable info if you must) here. It is hard to see anything from the screenshot you have shared.

ADD REPLYlink written 3.2 years ago by genomax70k
1
#ASSEMBLY_ID    HCC1187-H-200-36-ASM-N1                                                         
#CHROMOSOME chr1                                                            
#FORMAT_VERSION 2                                                           
#GENERATED_AT   2012-Jan-14 08:27:41.197287                                                         
#GENERATED_BY   ExportEvidence                                                          
#SAMPLE GS00258-DNA_F01                                                         
#SOFTWARE_VERSION   2.0.2.15                                                            
#TYPE   EVIDENCE-INTERVALS                                                          

>IntervalId Chromosome  OffsetInChromosome  Length  Ploidy  AlleleIndexes   EvidenceScoreVAF    EvidenceScoreEAF    Allele0 Allele1 Allele2 Allele3 Allele1Alignment    Allele2Alignment    Allele3Alignment        
0   chr1    552 55  2   1   2   0   0   CAGAGGACAACGCAGCTCCGCCCTCGCGGTGCTCTCCGGGTCTGTGCTGAGGAGA CAGAGGACAACGCAGCTCCGTCCTCGCGGTGCTCTCCGGGTCTGTGCTGAAGAGA CAGAGGAGAACGCAGCTCCGCCCTCGCGATGCTCTCCGGGTGTGTGCTAAGGAGA     55M 55M     
1   chr1    612 17  2   0   1   0   0   ACTCCGCCGGCGCAGGC   ACTTCACCGGCGCAGGC           17M         
2   chr1    959 41  3   1   2   3   1358    1231    GAAACTCACGTCACGGTGGCGCGGCGCAGAGACGGGTAGAA   GAAACTCACGTCACGGCGGCGCGGCGCAGAGACGGGTGAAA   GAAACTCACGTCACGGCGGCGCGGCGCAGAGACGGGTGGAA   GAACCTCACGTCACGGTGGCGCGGCGCAGAGACGGGTAGAA   41M 41M 41M
ADD REPLYlink modified 3.2 years ago by genomax70k • written 3.2 years ago by ptb_ee60
1

I'm trying to find out which file you actually downloaded, could you share the path on the ftp site or the complete filename. I have the idea that I'm looking at something about structural variants.

ADD REPLYlink written 3.2 years ago by WouterDeCoster40k
1

this is the complete file name <evidenceintervals-chr1-hcc1187-h-200-36-asm-n1> the ftp site isn't working since some days.

ADD REPLYlink modified 3.2 years ago • written 3.2 years ago by ptb_ee60

Right, evidence intervals. Have my doubts you can convert this to fast(a/q). (And definitely my doubts whether it's meaningful.) If I'm not terribly mistaken these intervals just describe the alleles at each variant site, not something you can directly use to reproduce the alignment. Could you expand on what exactly you would like to achieve, and why?

Note that you can convert evidenceDnbs-type files to sam files (showing alignment on variant sites) using cgatool evidence2sam. Would this help you?

ADD REPLYlink written 3.2 years ago by WouterDeCoster40k
1

At first, We needed to get data as DNA sequences of people infected with Cancer and normal people to get variants by doing alignment then compare the normal variants with the variants of infected people to get Cancer variants. So we need to get the sequences from this data to do alignment and the remaining variant discovery process, but our background in bioinformatics isn't good as our major is computer engineering. This is our graduation project.

ADD REPLYlink modified 3.2 years ago • written 3.2 years ago by ptb_ee60

If you only need to do alignments then you don't need to convert data to fastq format.

ADD REPLYlink written 3.2 years ago by genomax70k

we work in sparkBWA so we need fastq . we need data of cancer fastq or bam where can we get it ????

ADD REPLYlink written 3.2 years ago by rere.adel201240

If this is data from complete genomics then you would have to get the fastq or bam from there (I don't have experience with CG but I assume this may be proprietary data).

If you are looking for public Cancer data then TCGA data portal and/or ICGC portal would be your options.

ADD REPLYlink modified 3.2 years ago • written 3.2 years ago by genomax70k

I couldn't deal with this sites , and the files I have downloaded are in strange format I couldn't find dna sequence .. here is the data I have found

sample_id   sample_type matched_sample_id   donor_id    diagnosis_id    sex age_at_diagnosis    age_at_recruitment  biobank_id  consent consent_version irb_approval_acquired   last_follow_up_date donor_record_created_date   donor_record_last_update_date   donor_record_release_date   therapy_type    therapy_response    disease_site    tumour_sample_anatomic_location primary_tumour_type primary_metastatic_recurrent    clinical_staging    pre_or_post_tx_collected    tissue_type diagnosis_record_created_date   diagnosis_record_last_update_date   diagnosis_record_release_date   quantity_on_hand    collection_date sample_freezing_method  sample_record_created_date  sample_record_last_update_date  sample_status   optical_image_stained_section   pathological_m  pathological_n  pathological_t  pathology_stage_grouping    percent_intact_tumour_cells storage_medium  tissue_fixation_protocol    
749710  cell line   914566  649719  668681  female      52      Yes     Yes     2010-02-25  2010-02-25              Breast      Breast cancer   primary         malignant   2010-02-25  2010-02-25                  2010-02-25  2010-02-25                                      
749711  cell line   911798  649720  668682  female      41      Yes     Yes     2010-02-25  2010-02-25              Breast      Breast cancer   primary         malignant   2010-02-25  2010-02-25                  2010-02-25  2010-02-25                                      
749712  cell line   911799  649721  668683  female      43      Yes     Yes     2010-02-25  2010-02-25              Breast      Breast cancer   primary         malignant   2010-02-25  2010-02-25                  2010-02-25  2010-02-25                                      
749713  cell line   911800  649722  668684  female      44      Yes     Yes     2010-02-25  2010-02-25              Breast      Breast cancer   primary         malignant   2010-02-25  2010-02-25                  2010-02-25  2010-02-25                                      
749714  cell line   911801  649723  668685  female      23      Yes     Yes     2010-02-25  2010-02-25              Breast      Breast cancer   primary         malignant   2010-02-25  2010-02-25                  2010-02-25  2010-02-25                                      
749709  cell line   911802  649718  668680  female      61      Yes     Yes     2010-02-25  2010-02-25              Breast      Breast cancer   primary         malignant   2010-02-25  2010-02-25                  2010-02-25  2010-02-25                                      
749715  cell line   911803  649724  668686  female      48      Yes     Yes     2010-02-25  2010-02-25              Breast      Breast cancer   primary         malignant   2010-02-25  2010-02-25                  2010-02-25
ADD REPLYlink modified 3.2 years ago by genomax70k • written 3.2 years ago by rere.adel201240

That appears to be just metadata table. You may need to apply for access to get the TCGA data.

If you just want some cancer data then there are plenty of options in EBI-ENA. You will be able to find the fastq files (no aligned files here you will have to create them yourself) once you drill down to samples.

You may also want to take a look at Cancer cell line data here. This does not require authorization.

ADD REPLYlink modified 3.2 years ago • written 3.2 years ago by genomax70k
4
gravatar for genomax
3.2 years ago by
genomax70k
United States
genomax70k wrote:

For those who may happen to reach this thread by way of search in future you can convert a fasta file to fastq format using reformat.sh from BBMap suite.

Please remember that the Q-scores created here are fake (example below sets Q-scores to 35 for all bases).

reformat.sh in=test.fa out=fake.fq qfake=35

ADD COMMENTlink modified 3.2 years ago • written 3.2 years ago by genomax70k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1158 users visited in the last hour