Building a Semantic Course Search Engine Using Weighted BERT Embeddings

3 minute read

In this project, we build a production-style semantic search system for online course content.
The system allows users to search for relevant course sections using natural language queries, retrieving results based on semantic meaning rather than keywords, powered by weighted BERT embeddings and a vector database.

We explore multiple embedding strategies and data granularities, evaluate their effectiveness, and deploy the best-performing approach as an interactive web application.

🔗 Project Links

GitHub Repository: https://github.com/seyyednavid/Semantic-course-search
Live Application / Demo: https://semantic-course-search6264.streamlit.app/

00. Project Overview
01. Data Overview
02. Semantic Search Overview
03. Embedding Strategies
04. Vector Database Design
05. Weighted Semantic Querying
06. Application & Examples
07. Growth & Next Steps

00. Project Overview

Context

Online learning platforms contain large volumes of educational content, often organised into courses and fine-grained sections.
Traditional keyword-based search systems struggle to surface relevant content when users phrase questions differently from the original text.

Examples of such queries include:

“AI applications for business success”
“Regression in Python”
“Data science course”

Answering these accurately requires understanding semantic intent, not just matching words.

Actions

We built a semantic search pipeline that:

Embeds course content using modern sentence-transformer models
Stores embeddings in a vector database (Pinecone)
Retrieves the most semantically similar content using cosine similarity
Explores different data granularities (course-level vs section-level)
Evaluates lightweight and transformer-based embedding models
Deploys the best-performing approach as a Streamlit application

Results

The final system:

Consistently retrieves highly relevant course sections
Outperforms keyword search for conceptual queries
Demonstrates clear improvements when moving from course-level to section-level indexing
Achieves the best results using weighted BERT embeddings

The deployed application allows users to interactively explore these results in real time.

Growth / Next Steps

Potential future improvements include:

Quantitative evaluation metrics (Precision@K, Recall@K)
Query latency benchmarking across models
Model selection within the UI
User feedback loops to refine relevance

01. Data Overview

Two datasets were used in this project:

Course-level descriptions
One row per course, containing high-level summaries.
Section-level descriptions
One row per course section, containing fine-grained instructional content.

The section-level dataset enables more precise retrieval, as semantic similarity is computed over smaller, more focused units of text.

02. Semantic Search Overview

Semantic search differs from keyword search by representing text as dense vectors that capture meaning.

With vector-based search:

Text is embedded into a numeric vector space
Similar meanings map to nearby vectors
Queries retrieve results using similarity metrics such as cosine similarity

This allows the system to retrieve relevant content even when the query wording differs from the source text.

03. Embedding Strategies

We evaluated four embedding strategies:

Course-level MiniLM embeddings
Section-level MiniLM embeddings
Section-level BERT embeddings (unweighted)
Section-level BERT embeddings (weighted)

Moving from course-level to section-level indexing significantly improved precision.
Using a stronger transformer model further improved semantic understanding.

04. Vector Database Design

All embeddings are stored in Pinecone, a managed vector database optimised for similarity search.

Key design choices:

Cosine similarity as the distance metric
Section-level documents as the primary retrieval unit
Metadata storage for course name, section name, and descriptions

This design allows fast and scalable semantic retrieval.

05. Weighted Semantic Querying

This weighted embedding strategy represents the key improvement over standard semantic search and is the method used in the deployed application.

The final approach introduces weighted semantic query embeddings.

Instead of embedding the user query once, we:

Encode the raw user query
Encode a contextualised version of the query
Combine both embeddings using weighted averaging

This reinforces the core semantic intent while preserving contextual relevance, improving retrieval quality for short or ambiguous queries.

This approach proved especially effective for short, ambiguous, or high-level user queries.

06. Application & Examples

The final system is deployed as a Streamlit web application.

Users can:

Enter natural-language queries
Retrieve the most relevant course sections
Inspect similarity scores
Expand detailed section descriptions

Example queries include:

“technical analysis indicators”
“support and resistance levels”
“momentum oscillators explained”

The system consistently retrieves conceptually relevant sections, even when keyword overlap is minimal.

07. Growth & Next Steps

Future enhancements may include:

Adding user feedback to re-rank results
Hybrid search combining keyword and vector similarity
Incremental re-indexing pipelines
Advanced UI filtering and analytics

This project demonstrates how semantic search and vector databases can be combined to build practical, user-facing information retrieval systems.

Navid Hejazi

Data Science Portfolio