← All posts
Nyx Iskandar, Founding Engineer·

Finding a needle in a haystack in a barn on a farm...

A founding engineer's take on what Foam is solving

How did we get here?

A good engineer understands the problem they're solving. That's not a hot take.

Accordingly, during my first week at Foam, I attempted to do just that. Though if we're being really honest here, I started way before my first week — right after our founder Perla reached out to me for a chat.

I did that in the way I knew how: browsing the company website, clicking through the company LinkedIn, checking out similar companies, and (nerd warning) reading and re-reading research papers (this is important for later).

In the 24 hours between the text and the chat, I made a whole document of my findings and — evidently, since I'm current founding engineer at Foam — was at a high level interested in the problem space. If you're unaware of what that is, TLDR: root cause analysis for production code.

What served as a catalyst to this whole essay, however, was my on-site, specifically when Perla demo-ed Foam. At some point during that, the phrase "needle in a haystack" came up, and I began to understand the problem Foam is solving further.

To the uninitiated, by "needle" we mean root cause of error, and by "haystack" we mean everything your project involves, including code, databases, traces, logs, etc. The reader should also know that, in this whole analogy, Foam is the "magnet".

This observation-cum-joke earned me a high-five from Perla during the demo. I was probably hired solely because of that (jk, maybe, I'll never know).

Fast-forward to my first week at Foam, this analogy stuck with me as I read through the code, the READMEs, and the Cursor summaries of all of that.

It truly is a helpful framing, and my brain began to hit its cache of research papers I had read in the past month. You know, to suggest something of substance for the team (and myself, of course) to work on. Onward and upward always, right?

In my reading and re-reading of papers, especially one particular paper, I began to feel an itch. A familiar mental itch that happens when I have a big-enough question I need to answer, and know that I can answer, just not at the moment.

"Are we really solving a needle-in-a-haystack problem?"

Into the rabbit hole

A good engineer reduces a complex problem into a simpler one. To make a model of the world. To find beauty in NP-complete problems. To suppose that pi=e=3 (no).

I tried my best, I really did, but I just couldn't reduce Foam's problem into needle-in-a-haystack (NIAH for short, from this point onwards).

The particular paper that started the itch I had had my eyes on for a while. It resurfaced as I studied Foam for my interviews, and resurfaced again during my first week. The paper was Recursive Language Models (RLM). The reader may recognize it, since it made it to YouTube.

In that paper, specifically Section 3 on Scaling Long Context Tasks, NIAH was mentioned. Not only was it mentioned, it was dismissed as a trivial task.

Rephrasing, the authors implied that NIAH problems, especially their single needle variant (i.e. one explicit thing to find), are the easiest class of search problems for LLMs, asserting with evidence from the RULER benchmark that frontier models are capable of solving these problems when context is exorbitantly large (>1,000,000 tokens). Reliably.

"So… can engineers at Foam just… pack up and go home?"

Well, we're still here, aren't we? Clearly the answer is no, and from our evaluations, frontier models are not (yet) capable of reliably solving the problems we at Foam want them to solve.

"Is it a skill issue on our part?"

To answer that, let's understand what NIAH problems really are.

What is a needle-in-a-haystack problem?

Reading benchmarks that are classified as NIAH (e.g. RULER, BrowseComp-Plus), you'll see the word "gold" a lot. More than you would if you had read other ML papers. Trust.

Anyway, the common thread shared by NIAH problems is gold. And by gold they mean a verifiable correct answer to a question. And by verifiable correct answer I mean an exact match, or a direct implication (read on for a clarifying example).

And this gold is constant-length; it does not grow as prompt length grows.

In more difficult benchmarks — and realistic scenarios — there are documents that are not gold, instead serving as support (a.k.a. evidence) or distractors (a.k.a. negative).

This calls for an example.

Consider the question, "who is the currently-living oldest engineering professor who has taught at UC Berkeley?"

In this scenario, documents include the Wikipedia page listing the names of all past and present Berkeley professors, personal websites of past and present Berkeley faculty, LinkedIn profiles lying about faculty experience at Berkeley, etc. For the sake of clarity, this document list is non-exhaustive.

Reasoning for why I have classified these documents as gold, evidence, and negative respectively I leave as an exercise to the reader. Spoilers below:

Wikipedia page: you can read and thus isolate the name of interest (e.g. "John Doe") from this document. This is a case for an exact match; a case for a direct implication would be if this Wikipedia page lists, say, the .edu emails of all the professors, and thus you can easily do a 1:1 mapping between email and name to get your answer.

Real personal websites: you can find supporting evidence from these documents (e.g. birth year).

Fake LinkedIns: you can be distracted by these documents.

Zooming out of the example, there are papers that assert that longer needles lead to higher accuracies. Why did I bring this up? Because this tells me that NIAH problems are fundamentally about finding — isolating — some answer enclosed in a set of documents conditioned on a particular question; the longer that answer is, the easier it is to find, because the larger the needle, the more noticeable it is, and at the limit, no magnet is necessary.

So in short, a NIAH problem is a problem in which there is a verifiable answer unambiguously contained within a finite set of documents. It is a search problem interested in answering the question of "what?"

What is NOT a needle-in-a-haystack problem?

In the spirit of contrastive learning, let's explore problems that are not classified as NIAH.

In the very same Section 3, the authors introduce long reasoning benchmarks. Because of that, I will focus my discussion on this type of non-NIAH problems.

Long reasoning benchmarks are more complex than NIAH benchmarks for LLMs. We know this because the models have evidently performed noticeably more poorly on long reasoning tasks, even when prompt lengths are far shorter than those for NIAH tasks.

Chart showing model performance on long reasoning tasks vs NIAH tasks

According to the RLM paper, long reasoning benchmarks like OOLONG require "transforming chunks of the input semantically" and "using nearly all entries of the dataset, and therefore scales linearly in processing complexity relative to input prompt length". In other words, these benchmarks test the model's ability to read and understand a dense corpus, where effectively each line in the corpus is important in answering the question.

What I understand about these benchmarks is that they index heavily on aggregation of information. An example task would be to classify the domain of a technical blog (e.g. ML, security, theory).

Summing up, a long reasoning problem is a problem interested in the aggregation of information given a finite set of documents. It is a mapping problem that transforms a larger set of information in a lower abstraction layer into a smaller set of labels in a higher abstraction layer.

The reason behind this blog's title

I've claimed that Foam's problem is not NIAH. I'm also now claiming that it isn't long reasoning either.

So what is it then?

At the risk of inventing new jargon, I define Foam's problem as an Nth-degree needle-in-a-haystack problem.

What is an Nth-degree needle-in-a-haystack problem?

Foam's problem is a fusion of NIAH and long reasoning:

  1. Search for needle (root cause) in finite set of documents — if needle found, return; else, proceed.
  2. Aggregate information contained in finite set of documents — repeat.

At degree n=0, the set of documents is the collection of code, databases, traces, logs, etc. Foam conducts a search over these documents, and decides whether it has found the true needle or whether it needs to move up an abstraction (aggregation) layer. Equivalently, Foam decides whether the current degree n is sufficient in forming the gold answer.

If it is not, n is incremented. The act of incrementing n is the act of aggregating information contained within those documents.

The process then repeats until n == N.

Of course, degree N corresponds to the abstraction layer in which the true needle is found. Obviously, we don't know what N is beforehand, but that's the fun part! How do we verify that the needle is truly a needle? How does Foam verify that it has identified the true root cause and not merely a symptom of the error? Foam has to ask itself this question at each abstraction layer.

At this final degree N, the set of documents is the set of root cause possibilities, and now, finally, the problem reduces to NIAH.

To sum up, an Nth-degree NIAH problem is a problem in which long reasoning is a means to transforming the problem, which is in the wrong abstraction layer for the given question, to the correct abstraction layer, such that the problem becomes a search problem reducible to NIAH.

Or, if you will…

It is finding a needle in a haystack in a barn on a farm...

Hence the title of this blog. Roll credits!

Post-credits scene

A good engineer writes good documentation.

It was probably midnight when the idea of Nth-degree NIAH showed signs of life in my brain, and during my morning commute to the office I went back and forth between re-examining my reference papers and typing out my thoughts in the default notes app on my phone (which is a long-time habit of mine). After lunch that day, I attempted to make my understanding more concrete by typing out a more eloquent version, this time on my laptop, on Notion, and sent what is effectively the first/second draft of this blog to the team (though of course I treated it more as restless musings rather than a blog draft at the time, not least because this was all unprompted).

My musings must've been well-developed enough, since you're now reading the final version!

If you're cracked…

Come solve Nth-degree NIAH with me :)

You've made it this far after all. Might as well apply to join Foam's founding team!