Hacker New’s comment-based format – an unstructured list of information being presented in a conversational format – is great resource, but it’s a lot to get through. Consequently, a number of alternative UI’s have been developed for it and this is my attempt.

I’ve been looking for a small, end-to-end problem that I could apply my AI skills too (Data capture, RAG, prompting, UI), and this seemed like a good fit. Since Large Language Models thrive on unstructured conversational data, we will be building an LLM system that parses and retrieves job descriptions, matching them to requirements provided in plain english by the user. I’m writing this as I go, so I’ll build out additional features and and the front end will be developed once the data has been acquired.

It would be nice if the AI could also provide the user with the ability to play various scenarios, or point out potential ‘near-misses’ by evaluating the tail end of retrieved searches: perhaps a user doesn’t want to commute more than 50 miles, and would otherwise miss an opportunity at 51.

  • Since each month’s page is live (and receives new job listings) for the whole month, the LLM will need to run periodically.
  • We will want to parse the information into a retrieval system determine the optimal RAG strategies, and create appropriate retrieval agent for the user to interact with.
  • We will need to determine an appropriate hosting strategy, if we want to make this publicly available.

But first, we need to gather and look at the data.

1. Scraping the page with Selenium

Using Selenium, we grab the latest HN ‘who’s hiring’ page. Simple enough, though installing chrome web drivers was a bit tricky. Since Hacker New’s page format creates overflow pages as new messages come in, we need to iterate over the page numbers to to get all the data.

The notebook is here.

2. Exploration & Curation

The first step is to look at the data we’ve grabbed. Some things stand out immediately:

  • There is low value data to remove. Not all the comments are job postings – some are questions and banter.
  • There is a format for postings that appears to have organically developed – it’s not in the posting directions. However, it’s not always followed, and can’t be relied upon.
  • Related to that: the format (a header line, listing primary details divided by pipes ( I.e = "|” ). Could this be confusing to the wrong prompts? This doesn’t strike me as a common convention of the language – could it confuse the wrong prompts?
  • Job specifications contain some relatively universal features, such as company location, job title, compensation, also have a long tail of less common requirements, that we will need to surface for the user.

2.1 Handling low value data

The comments are sporadic, and relatively low-value: I concluded that they could be removed from the dataset without harming the information provided to the user too much. If long conversations were happening under many posts, the story would be different, and we would want to extract the additional information from these posts.

That said, there’s still some value in these conversations. My intention for the final product is to do a screen capture of each job description, that the AI can present to the user as part of the results.

Removing non-job-listing comments is relatively easy: they don’t look at all like job descriptions, so we can ask an LLM to mark or remove them as we parse the data with relative confidence. To be certain, given the size of the dataset, it will not be hard to capture all the removed comments and verify that how well this has worked.

2.2 Handling various formats

During experimentation, I tried removing (presumably) extraneous formatting, and received different results. Once I’ve generated evals, I’ll be in a place to experiment with the effects of text parsing and few-shot examples on output quality.

2.3 Developing a data schema for long tail data

My initial, unconsidered, assumption was that I would provide an appropriate schema for the data, and have an LLM parse the listings into that schema. However, browsing the job descriptions made it clear that I’d do a lot better by having the data dictate the schema for me. So, part of the parsing phase will involve generating keys that can be used to craft an appropriate schema for our data.

Ok, that’s as far as I’ve gotten at this point. I’ll post my next steps: building out a data pipeline and setting up the RAG shortly.