Building a Semantic Course Search Engine Using Weighted BERT Embeddings
In this project, we build a production-style semantic search system for online course content.
The system allows users to search for relevant course sections using natural language queries, retrieving results based on semantic meaning rather than keywords, powered by weighted BERT embeddings and a vector database.
We explore multiple embedding strategies and data granularities, evaluate their effectiveness, and deploy the best-performing approach as an interactive web application.
đź”— Project Links
- GitHub Repository: https://github.com/seyyednavid/Semantic-course-search
- Live Application / Demo: https://semantic-course-search6264.streamlit.app/
Table of Contents
- 00. Project Overview
- 01. Data Overview
- 02. Semantic Search Overview
- 03. Embedding Strategies
- 04. Vector Database Design
- 05. Weighted Semantic Querying
- 06. Application & Examples
- 07. Growth & Next Steps
00. Project Overview
Context
Online learning platforms contain large volumes of educational content, often organised into courses and fine-grained sections.
Traditional keyword-based search systems struggle to surface relevant content when users phrase questions differently from the original text.
Examples of such queries include:
- “AI applications for business success”
- “Regression in Python”
- “Data science course”
Answering these accurately requires understanding semantic intent, not just matching words.
Actions
We built a semantic search pipeline that:
- Embeds course content using modern sentence-transformer models
- Stores embeddings in a vector database (Pinecone)
- Retrieves the most semantically similar content using cosine similarity
- Explores different data granularities (course-level vs section-level)
- Evaluates lightweight and transformer-based embedding models
- Deploys the best-performing approach as a Streamlit application
Results
The final system:
- Consistently retrieves highly relevant course sections
- Outperforms keyword search for conceptual queries
- Demonstrates clear improvements when moving from course-level to section-level indexing
- Achieves the best results using weighted BERT embeddings
The deployed application allows users to interactively explore these results in real time.
Growth / Next Steps
Potential future improvements include:
- Quantitative evaluation metrics (Precision@K, Recall@K)
- Query latency benchmarking across models
- Model selection within the UI
- User feedback loops to refine relevance
01. Data Overview
Two datasets were used in this project:
-
Course-level descriptions
One row per course, containing high-level summaries. -
Section-level descriptions
One row per course section, containing fine-grained instructional content.
The section-level dataset enables more precise retrieval, as semantic similarity is computed over smaller, more focused units of text.
02. Semantic Search Overview
Semantic search differs from keyword search by representing text as dense vectors that capture meaning.
With vector-based search:
- Text is embedded into a numeric vector space
- Similar meanings map to nearby vectors
- Queries retrieve results using similarity metrics such as cosine similarity
This allows the system to retrieve relevant content even when the query wording differs from the source text.
03. Embedding Strategies
We evaluated four embedding strategies:
- Course-level MiniLM embeddings
- Section-level MiniLM embeddings
- Section-level BERT embeddings (unweighted)
- Section-level BERT embeddings (weighted)
Moving from course-level to section-level indexing significantly improved precision.
Using a stronger transformer model further improved semantic understanding.
04. Vector Database Design
All embeddings are stored in Pinecone, a managed vector database optimised for similarity search.
Key design choices:
- Cosine similarity as the distance metric
- Section-level documents as the primary retrieval unit
- Metadata storage for course name, section name, and descriptions
This design allows fast and scalable semantic retrieval.
05. Weighted Semantic Querying
This weighted embedding strategy represents the key improvement over standard semantic search and is the method used in the deployed application.
The final approach introduces weighted semantic query embeddings.
Instead of embedding the user query once, we:
- Encode the raw user query
- Encode a contextualised version of the query
- Combine both embeddings using weighted averaging
This reinforces the core semantic intent while preserving contextual relevance, improving retrieval quality for short or ambiguous queries.
This approach proved especially effective for short, ambiguous, or high-level user queries.
06. Application & Examples
The final system is deployed as a Streamlit web application.
Users can:
- Enter natural-language queries
- Retrieve the most relevant course sections
- Inspect similarity scores
- Expand detailed section descriptions
Example queries include:
- “technical analysis indicators”
- “support and resistance levels”
- “momentum oscillators explained”
The system consistently retrieves conceptually relevant sections, even when keyword overlap is minimal.
07. Growth & Next Steps
Future enhancements may include:
- Adding user feedback to re-rank results
- Hybrid search combining keyword and vector similarity
- Incremental re-indexing pipelines
- Advanced UI filtering and analytics
This project demonstrates how semantic search and vector databases can be combined to build practical, user-facing information retrieval systems.