The comment by my colleague links to a thread that is >4 years old - a lot changes in 4 years. Thus, please permit that I provide updated suggestions.
Hey,
It is great that you are looking to build a portfolio with practical bioinformatics projects at the beginner level, especially with your familiarity in NCBI, Galaxy, command-line tools, and basic Python/R. Focusing on 1-3 small projects is a smart approach, as it allows you to demonstrate core skills like data retrieval, basic analysis, and visualization without overcomplicating things. These can highlight your ability to work with real data, which is often what entry-level roles seek in bioinformaticians.
For recommended domains to start with, I suggest beginning with sequence analysis, as it builds foundational skills in handling biological sequences and using common tools like BLAST or alignment software. From there, move into genomics or transcriptomics, where you can explore gene annotation or simple differential expression—areas that are accessible with your background and highly relevant to many labs. Avoid jumping straight into more complex fields like proteomics or multi-omics until you have these basics down, as they often require deeper statistical knowledge.
Here are a few beginner-friendly project ideas that align with these domains. Each can be completed in a few weeks, using public data and free tools, and they emphasize practical outputs like reports, visualizations, or GitHub repos to showcase your work:
Basic Sequence Analysis and Alignment Project: Retrieve DNA or protein sequences from NCBI and perform alignments using tools like BLAST or Clustal Omega. For example, compare sequences from different species to identify conserved regions, then visualize the results with Python's Biopython library. This demonstrates data querying, alignment algorithms, and simple scripting—key for entry-level roles.
Simple Genomic Variant Calling: Download a small genome dataset (e.g., bacterial or viral) and use command-line tools like BWA for mapping reads and GATK or samtools for variant detection. Analyze the variants for potential functional impacts using ANNOVAR. This project highlights your command-line proficiency and introduces genomics pipelines, which are common in research settings.
Introductory Transcriptomics Analysis: Use a public RNA-seq dataset to perform differential gene expression with DESeq2 in R or edgeR. Start with quality control via FastQC, align reads with HISAT2, and create heatmaps or volcano plots. Focus on a straightforward comparison, like treated vs. untreated samples, to show statistical analysis and visualization skills.
For public datasets suitable for practice, stick to well-curated sources that are easy to access and come with metadata. NCBI's Gene Expression Omnibus (GEO) is excellent for RNA-seq and microarray data—try datasets like GSE60450 for beginner transcriptomics. The TCGA database offers cancer genomics data, but start small with subsets via the UCSC Genome Browser. For sequence data, use NCBI's SRA (Sequence Read Archive) or Ensembl for genomes, and Kaggle has beginner-friendly biology datasets like gene expression matrices. Also, check GitHub repos like JEFworks/public-bioinformatics-datasets for integrated omics data that's publicly available and well-documented.
Regarding tutorials, workflows, or repositories to help build these projects, the Galaxy platform has excellent guided workflows for sequence analysis and RNA-seq—their tutorials are interactive and don't require heavy coding at first. For Python/R focused learning, Rosalind.info offers problem-based tutorials on sequence manipulation and algorithms, which you can solve and add to your portfolio. The Bioinformatics Workbook (bioinformaticsworkbook.org) provides step-by-step guides for projects like variant calling, with code examples in current versions of tools like samtools or Biopython. Harvard's learning-bioinformatics-at-home GitHub repo is a goldmine for self-paced resources, including scripts and datasets for beginners. Bioconductor vignettes (bioconductor.org) are great for R-based transcriptomics, with updated examples using DESeq2. Finally, edu.t-bio.info has courses and project templates that integrate command-line with Python/R.
Document your projects on GitHub, including code, READMEs with methods, and results visualizations—this will make them demonstrable for job applications. If you run into issues, feel free to follow up with specifics.
Kevin
Prior thread that may be of interest:
Beginner level projects for bioinformatics.