Question

How can I find a DNA sequence that codes for a protein from a large FASTQ file?

1

Entering edit mode

23 months ago

01167507 ▴ 10

Hello, I have a 48 GB FASTQ file and I would like to extract from it, the sequence coding for ANY protein. If anyone has any recommendations as to how I would do this I would really appreciate it. I tried using galaxyproject to convert it to a FASTA file to start with, but the file is too big. Also, I am using a macOS. Background: It's for a class project comparing a protein (bioinformatically) from an extinct mammoth species to its living relatives. It's just for practice purposes/learning with the focus on the protein itself, but this first step is proving to be the most difficult. Thank you for your help :)

conversion protein FASTQ extinct macOS mammoth • 1.4k views

ADD COMMENT • link updated 23 months ago by cmdcolin ★ 3.8k • written 23 months ago by 01167507 ▴ 10

score 0 · Answer 1 · 2022-05-21

I think most people on this forum will tell you that you should solve your homeworks independently. That is the whole point of learning, and if your instructor(s) wanted to give you hints, they would have done so.

To get you started, SeqKit can convert fastq to fasta format. If that doesn't work for some reason, BBTools is Java-based and has programs that can do the conversion. From there you should be able to figure out which flavor of BLAST will let you search that fasta database.

score 0 · Answer 2 · 2022-05-21

I will add one more idea. Once you get the DNA sequence of the protein you are looking for you can align (using a NGS data aligner) the reads against that sequence and then simply extract reads that map from the alignment file. While this is generally not recommended when you have sequence from the entire genome, for your exercise that may be enough.

BTW: This is a tall ask to do on a laptop.

score 0 · Answer 3 · 2022-05-21

FASTQ files are generally raw reads, and you would not generally look for protein sequences from these raw reads. Possible reasons: short reads would only have fragments of the protein sequences on them, and long reads like pacbio/nanopore often have errors that make them not really suitable to use in their raw form (short reads also have errors). Either way, it's not common to look for protein sequences directly on reads to my knowledge.

Alternative approaches could include

1) (harder) use a de-novo genome assembler, to assemble your FASTQ reads into a genome assembly (FASTA), then, you would run gene prediction on your genome assembly. This would give you your protein sequences. doing a genome assembly is quite laborious. de-novo genome assemblers include miniasm, abyss, etc. gene prediction tools include MAKER

2) (easier) align the reads from the FASTQ file to a species that already has a reference genome e.g. http://hgdownload.soe.ucsc.edu/downloads.html#elephant (this is similar to what GenoMax suggests). and then you can check e.g. where is the P53 gene on the elephant genome, and how do the reads from the mammoth genome align in that region. you could determine the variants using a program called a variant caller. this process has caveats--you have to be careful interpreting the data, as you are aligning reads to a totally different species. but, generally in all data analysis, you have to be careful and be aware of caveats. aligning reads can be done with tools like bowtie, bwa, or minimap2. variant callers include tools like bcftools or gatk. you can then determine the effect of those variants using a variant effect prediction, this can tell you the consequence on the protein using tools like ensembl VEP or SnpEff.