Vibe coding a parser

I have a custom tool written in PHP that I use to manage references, associate them with local copies of pdfs of the papers, and build annotated bibliographies. When I wrote it I used the RefWorks Tagged format to import new references, but I want to add support for importing BibTeX format now, as I’m using LaTeX for all my writing. I used the RefWorks format initially because it is trivial to parse: each field is a single line, starting with a two character field identifier. BibTeX is a bit more complicated: superficially it looks trivial, but there are some tricky issues. Curly brackets are used for the block round each reference’s fields, but also as an alternative string delimiter, and identifiers are so flexible that they require a context sensitive lexer.

To get going I first searched for an existing BibTeX parser, and quickly found I found two. The first, on GitHub, I rejected with a fairly cursory look at the code – PHP that looks like it was written by a Java addict is never a good sign, and it all looked way over the top as a solution for a rather simple problem.

The second one was on PEAR, the PHP Extension and Application Repository, the PHP equivalent of JavaScript’s npm or .NET’s nuget. It hadn’t been updated for several years, but at least looked like sane code. I spent about 30 minutes updating the code to be compatible with PHP 8, and tested it on some examples. It worked OK on most, but on my full bibliography it went into a loop and timed out. Rather than search through over 1000 lines of code for a bug that wasn’t triggering errors, I decided to try another approach.

I first tried using my own parser tools, but quickly realised BibTeX is too different from the conventional looking programming languages these are designed for, and it was going to take a lot of work. however I also realised that my older lexer-like parsers would have worked, and actually would have made it fairly simple. So I thought I’d just try and hand write a lexer-style parser.

I was using Visual Studio 2026, something I normally avoid when writing PHP, as the AI suggestions are usually useless and annoying. On this occasion however, as soon as I’d typed:

class bibtexParser
{
    public function parse($data)
    {

The AI completed it to be:


class bibtexParser
{
    public function parse($data)
    {
        $entries = [];
        //$this->mergeMultilineFields($data);
        $lines = explode("\n", $data);
        $currentEntry = null;
        foreach ($lines as $line) {
            if (preg_match('/^@(\w+)\s*{\s*(\w+),/', $line, $matches)) {
                if ($currentEntry) {
                     $entries[] = $currentEntry;
                }
                $currentEntry = [
                    'type' => $matches[1],
                    'key' => $matches[2],
                    'fields' => []
                ];
            } elseif (preg_match('/^\s*(\w+)\s*=\s*{(.*)},?/', $line, $matches)) {
                if ($currentEntry) {
                    $currentEntry['fields'][$matches[1]] = $matches[2];
                }
            }
        }
        if ($currentEntry) {
            $entries[] = $currentEntry;
        }
        return $entries;
    }

I tried it on my examples, and it worked. Not perfect, but with one more AI assisted tweek to clean up fields with nested curly braces, it’s good enough for my current needs.

In the past I’ve never found AI helpful with PHP – I’ve assumed that reflects the standard of PHP code that’s available for training. But on this occasion, I think the AI code is basically a lot less bad than either of the examples I found. It has issues, and needs some refinement, but it’s not bloated.

Leave a Reply

Your email address will not be published. Required fields are marked *