About
-
Semantic search over all 450+ My First Million (MFM) podcast episodes.
-
Can be used for any YouTube playlist. I chose a podcast for this hackathon as it is something most people are familiar with.
-
Can combine multiple channels, parts of channels, or just an assortment of videos of your choice.
Use Cases
- Podcasts: Upper limit (while maintaining this high speed of query/answer) is ~8X this project. Can theoretically handle much larger podcasts such as JRE, etc.
- Education: many people create playlists for learning new topics, picking and choosing from variety of channels. As a data-centric platform, educational courses are primed for this technology.
- Customizability: Flexible index creation. Need not be one channel or one category. Mix and match to your liking.
- Advertising / Business: Data-driven insights platform for businesses, advertisers, and the creators themselves.
- Better user experience: Users can find content quicker. YouTube is becoming more and more an educational / news platform for people all around the world as video content continues to trend in popularity, for all types of things besides just entertainment.
Tech Stack
This project uses basic Python scripts, a vector database, and semantic k-nearest search (KNN).
- YouTube V3 API - Fetches and processes videos from YouTube to use as transcript backend powering semantic search.
- Pinecone.io- vector DB backend storing video transcript data and powering semantic search for the frontend.
- OpenAI's text-embedding-ada-002 - used in conjunction with vector DB. Allows client more tools beyond basic keyword search. Read more on k-nearest-neighbor (KNN) algorithm.
Videos are transcribed using some hacky Python scripts, combined with associated metadata, and pre-processed (cleaned). The transcipts are chunked and vectorized into a database by tokens and converted to text embeddings with ~ 16k dimensions. Dimensionality of 1536 using Cosine Similarity Search.
See the full breakdown here
Next Steps & Feedback
Some of my plans to improve this project:
- Providing a selection of podcasts upfront to user; categories and anaytics. Focusing on podcasts as a niche, expanding later to all YouTube forms.
- Allow user-uploaded analytics and embeddings. Allow users to monetize their work and create thriving marketplace but more importantly, a data-driven insights platform for businesses, advertisers, creators themselves, and consumers looking for content.
- Technical improvements: Moving away from YouTube V3 API towards a faster transcribing solution. Whisper is good but expensive and pytube and other Python packages are probably going to be used once the amoutn of video content exceeds a certain storage capacity.
- Adding visual elements to search experience (i.e. thumnbail generation specific to the exact timestamp) using Puppeteer or some other solution.
Feel free to send me feedback at me@vd7.io
Notice & License
- Support my open source work by sponsoring me before my API costs explode.
- Independently created. Not affiliated with MFM Podcast. Not affiliated with YouTube nor any of the companies mentioned above.