Building a HN jobs chatbot. Part 1.
9/13/2024
Hacker New’s comment-based format – an unstructured list of information being presented in a conversational format – is great resource, but it’s a lot to get through. Consequently, a number of alternative UI’s have been developed for it and this is my attempt.
I’ve been looking for a small, end-to-end problem that I could apply my AI skills too (Data capture, RAG, prompting, UI), and this seemed like a good fit. Since Large Language Models thrive on unstructured conversational data, we will be building an LLM system that parses and retrieves job descriptions, matching them to requirements provided in plain english by the user. I’m writing this as I go, so I’ll build out additional features and and the front end will be developed once the data has been acquired.
It would be nice if the AI could also provide the user with the ability to play various scenarios, or point out potential ‘near-misses’ by evaluating the tail end of retrieved searches: perhaps a user doesn’t want to commute more than 50 miles, and would otherwise miss an opportunity at 51.
- Since each month’s page is live (and receives new job listings) for the whole month, the LLM will need to run periodically.
- We will want to parse the information into a retrieval system determine the optimal RAG strategies, and create appropriate retrieval agent for the user to interact with.
- We will need to determine an appropriate hosting strategy, if we want to make this publicly available.
But first, we need to gather and look at the data.
1. Scraping the page with Selenium
Using Selenium, we grab the latest HN ‘who’s hiring’ page. Simple enough, though installing chrome web drivers was a bit tricky. Since Hacker New’s page format creates overflow pages as new messages come in, we need to iterate over the page numbers to to get all the data.
The notebook is here.
2. Exploration & Curation
The first step is to look at the data we’ve grabbed. Some things stand out immediately:
- There is low value data to remove. Not all the comments are job postings – some are questions and banter.
- There is a format for postings that appears to have organically developed – it’s not in the posting directions. However, it’s not always followed, and can’t be relied upon.
- Related to that: the format (a header line, listing primary details divided by pipes ( I.e =
"|
” ). Could this be confusing to the wrong prompts? This doesn’t strike me as a common convention of the language – could it confuse the wrong prompts? - Job specifications contain some relatively universal features, such as company location, job title, compensation, also have a long tail of less common requirements, that we will need to surface for the user.
2.1 Handling low value data
The comments are sporadic, and relatively low-value: I concluded that they could be removed from the dataset without harming the information provided to the user too much. If long conversations were happening under many posts, the story would be different, and we would want to extract the additional information from these posts.
That said, there’s still some value in these conversations. My intention for the final product is to do a screen capture of each job description, that the AI can present to the user as part of the results.
Removing non-job-listing comments is relatively easy: they don’t look at all like job descriptions, so we can ask an LLM to mark or remove them as we parse the data with relative confidence. To be certain, given the size of the dataset, it will not be hard to capture all the removed comments and verify that how well this has worked.
2.2 Handling various formats
During experimentation, I tried removing (presumably) extraneous formatting, and received different results. Once I’ve generated evals, I’ll be in a place to experiment with the effects of text parsing and few-shot examples on output quality.
2.3 Developing a data schema for long tail data
My initial, unconsidered, assumption was that I would provide an appropriate schema for the data, and have an LLM parse the listings into that schema. However, browsing the job descriptions made it clear that I’d do a lot better by having the data dictate the schema for me. So, part of the parsing phase will involve generating keys that can be used to craft an appropriate schema for our data.
Ok, that’s as far as I’ve gotten at this point. I’ll post my next steps: building out a data pipeline and setting up the RAG shortly.
tl;dr: A discussion of the pros and cons of storing embedding data with the data it represents vs an external vector database. I’ll be following this post up shortly with a walkthrough the boilerplate required to create and search embeddings in SQlite and PostGres.
I’ve been exploring RAG techniques and embeddings and as part of that, I’ve been checking out effective embedding generation and storage/retrieval options. As long as I’m able to perform vector searches, I don’t see the value in storing embedding data separately from the relational data it represents.
“Vector Databases: They are single-purpose DBMSs with indexes to accelerate nearest-neighbor search. RM DBMSs should soon provide native support for these data structures and search methods using their extendable type system that will render such specialized databases unnecessary“ – Stonebraker & Pavlo, 2024.
Do dedicated vector databases make sense?
Vector storage and search is at this point, essentially commoditized: going forward it’s not clear to me (or others) how dedicated vector databases can differentiate themselves from bog-standard relational databases with Vector search enabled.
Storing vector embeddings with the data that they represent is convenient and allows for succinct access to the results of vector-based searches. Some people are finding that keeping a separate vector DB in synch can be “painful at best, even for prototype applications.” That said, vector database providers are understandably keen to provide value. But, given that both storing vectors and searching them are solved problems, there doesn’t appear to be much room in which they could make any improvements.
So, even if adding vector storage to your existing database won’t work for you, making adding a secondary database your the least-worst option, there’s no obvious reason not to consider Postgres with PGVec, (or PGVec-scale) for that role.
Bonus: LLMS can speak SQL.
Many LLMs compose SQL well, which brings up an interesting possibility: LLM agents that can compose their own vector-based search queries. The use cases where this would make sense might be minimal at the moment: but interesting avenue nonetheless. I want to play around with that.
The contenders
Currently, I am creating embeddings for a dataset, storing them in Sqlite and Postgres, and performing connecting them to a local LLM, via a couple of extensions. Given that Sqlite and Postgres aren’t really competing databases, this isn’t going to be much of a comparison as much as a walkthrough of encoding and retrieving vector embeddings from SQL compatible databases. There’s Sqlite-vec, and PG-Vector.
No love for mySql
I couldn’t find an equivalent for mySQL, so if you are using mySQL, adding an secondary vector database appears to be your only straightforward option.
Of course, this isn’t likely to be the case for long: nearest-neighbour search is a solved problem, it just needs to be implemented for mySQL. There’s also been at least one stab at building one: MySQLvss. At the moment, this doesn’t seem to be a maintained: the last commit was six months ago: perhaps it can provide a starting off point, should you decide to build your own mySQL nearest-neighbour search.
Fwiw, many cloud services, such as both Oracle and Google are offering vector search functionality as part of their managed mySQL services.
Sqlite: Sqlite-vec, -lembed & -rembed
Sqlite-vec is a new database extension learned about during the AI Engineer World’s Fair keynote. It performs vector search allows the storage and retrieval of vector embeddings It seemed like something nice and shiny, and yet practical program to add to my RAG toolset.
Enabling extensions in sqlite can be less than straightforward, if you don’t have easy access to the sqlite C api. Enabling extensions via python sqlite library requires recompiling python with sqlite feature flags enabled, or just using the package from Homebrew, which comes with the feature enabled.
SqLite-Vec comes peer extensions that allow the execution of embedding models locally, or from a model running on a server or 3rd party service. PGVector is just provides yjr search function, so we’ll have to create the embeddings before we can use it. It’s a small convenience, but I can see how it would be useful for programmatic generation of embeddings (such as providing an agent semantic search over their interactions with the user and other behaviours).
Additionally, SqLite-Vec is built with WASM, so this can power AI running in the browser, or on embedded devices.
Postgres: PG-Vector, PG-Vector-scale
PGVector provides a nearest-neighbour vector search PostGres. As such, I expect it the one I’ll be reaching for more often. Additionally, it provide more sophisticated search algorithms than Sqlite: while Sqlite only implements cosine similarity, PG-Vector also search algorithms such as HNSW (Hierarchical Navigable Small World), IVFFlat (Inverted File Flat) and Euclidean distance.
PG-Vector-scale is a super fast iteration on PGVector, intended for really large deployments where the PGVector might hit performance or scale limitations. It is also addresses scalability for large datasets, distributed indexing and querying and handling billions of vectors with efficiency. As I write this, it’s not clear whether there are circumstances where it makes sense to use PGVector at all, hopefully as I dig into it, it will become clear.
Ok, so, these are the tools I’m currently playing with. Next post, I’m going to get into implementing them and working with them.
If you are a software engineer, no doubt you’ve seen some astounding new model or prompt posted to twitter, or some shamelessly fraudulent product demo and felt your blood run cold, your wizardly coding powers draining from your fingers.
The reality, however, is that Full Stack engineers are quite a bit closer to the modern AI engineering role today than they might think. Machine learning
βIn numbers, there’s probably going to be significantly more AI Engineers than there are ML engineers / LLM engineers. One can be quite successful in this role without ever training anything.β – Andrej Karpathy
AI engineering is no longer Deep Learning
The field of AI *used* to be an offshoot of Machine Learning (i.e. AI is what we used to refer to as Deep Learning), and at the bottom end of the stack, that is what it is. And certainly, if you want to build models from scratch, that’s what it takes. A little as a year ago, my feeling was that the way to approach AI was via the one I had take – via the fundamentals first: walk through the Fast.ai lesson series, getting a working understanding of Machine Learning processes as they relate to Deep Learning and build on that.
But, the systems and abstractions over the Deep Learning layer are now so powerful (and complex) that building them and using them takes an entirely different skill set. AI is being absorbed into actual engineering, and becoming an engineering field of its own. And, as such, what used to be the most direct and immediately applicable skills have changed. Moreover, as that has happened, the distance between “Full Stack” engineer and “AI” engineer has been falling over time.
Notice where the lines being drawn here? It’s not between you and AI engineering. And, AI is going to continue moving to the right on the line above: if the stack is entirely composed or managed by AI, a “Full Stack” engineer, who can’t also implement and manage AI pipelines, isn’t going to be all that “Full Stack” any more.
All in all, it’s just another tool in the tool box.
The Faustian bargain we made for a job creating shiny new toys was a never-ending supply of new shiny toys. And, AI appears particularly Faustian, in that (for now) that there’s a never-ending supply of new shiny AI tools, along with FOMO.
“We have no idea what large language models are going to be good at or bad at over any sense of time… The amount we don’t know because of how quickly this has developed is at an all time high, that lets us experiment and have a sense of play, and do things and not know how the result is going to be, which is fun. – Dan Becker @dan_s_becker
I’m working at Say Mosaic as Systems Architect of their flagship product: Smart Home in a Box. This has involved designing their cloud infrastructure, and helping improve their home-grown NLP/NLU AI systems. Today, we’re trending on product hunt.
I could get used to this trend of trending π
I say! I’m trending on github.
7/12/2016
I released a repo on Github on Sunday. And it’s trending today!
All the way down there in the middle π
Every time I find myself feeling a little silly being buoyed by a small bunch of internet points, I remember all the repos I’ve cloned or required, and where they got me π
Is there a bubble?
1/29/2015
I don’t know – you tell me.
Yesterday, my wife met someone in a pool ride who explained he was working at a start up that displayed pictures of the food in a restaurant, so you would choose which restaurant you would got to depending which pictures of food you liked. Yes – they are building Tinder for plates of food.
She asked him how they were going to take all the pictures, since most fancy restaurants are going to be switching out their menu once or twice a month.ΓΒ
Fireworks, filmed from a drone.
7/4/2014
Dear Alice and Ryan,
Congratulations on your new fondleslabs, and welcome to theΓΒ 90% of humanity that now spend all their time staring at their phones. Since I am also too busy staring at my phone to talk to you, I thought that you might appreciate a list of the things I am usually staring at, so you can stare at them too.