For many, writing regexes is something cumbersome, but I rediscovered the joy of writing them. These small systems are like the smaller versions of functions. They are famously taught in Automata Theory. However, one doesn't need to know that thing to notice the sheer witchcraft (and thus, excellence) behind each of the regex patterns. These systems impeccably conquer text and are used in a lot of codes to answer complex problems.
Regex, for me, felt like a lifeline when everything else was making the routines too rigid and basic.
However, before telling my experience, I need to tell you the context.
Context
In our college, we have a mandatory entrepreneurship course, where teams build a new product over the span of three years. It's quite common in most of the engineering colleges here.
Our team is focused on providing healthy foods. Most of the people in our team are from Biotechnology, only I (and another girl) are from Computer Science. I am the sole dev in the team. Yeah, it's quite freeing to do the work I want to do.
By the end of the third year, we will have to show the working prototype of our app (We are in the third semester, second year right now). One of the seniors advised me to build a website first, then focus on the app.
Instead of diving into a framework I was less familiar with, I chose a powerful and scalable stack I knew I could build on: Flask, Jinja, SQLite, and Tailwind CSS.
The first problem occurred when we had to pitch our product. The judges wanted to see our website. My website's About Page was complete. It was responsive, the navbar and footers were templated out of the main content of the About Page, and I even hosted the project as a whole on GitHub Pages.
When the judge saw the website, he asked if the website was complete, and we had to say "no".
Later, our team leader said that we would have to complete the website. And that's when I started thinking of what more I can do.
Walking on cobblestone
I started keeping a to-do list for this:
- Building the schema for the food catalogue
- Webpage for the food catalogue (using Shoelace, because it's more intuitive)
- Data Entry for the food catalogue
- Making 3 secure API endpoints
- Webpages for the rest
After a lot of contemplation, I completed making the schema (though it will definitely change in the near future).
Now, it was all fine. But trying to shoehorn my envisioned design into Shoelace was significantly challenging (still is). At one point, I was just seriously overwhelmed, and I started to ignore the work altogether.
Finally, after some clarification from Gemini (yes, the LLM), I thought of focusing on the Data Entry part. Yes, normally, data entry is overwhelming, but I knew something I had practised more than a year ago: Regular Expressions.
I have solved a lot of crosswords on Regex Crossword, I've even solved some AOC problems using Regex, and after a long time, as a student of computer science, I was prepared to play the music of regexes.
Entering the long-forgotten home
I was dealing with a PDF document stating all the recipes along with the Ingredients, Method, Benefits, Number of calories, et cetera.
The best thing Gemini suggested me is to break the whole parsing into multiple regexes and to use lookaheads. I have used lookbehinds a lot while solving Wordle, so it was relatively easy for me. I was also given a very specific format for making the regex: (<Header>)(<Non-greedy way of checking the content in between>)(<Lookahead for header>)
It first consumes and checks the header, then the content in between, then checks (but does not consume) if there's any other header around.
In case you're confused about greedy and lazy, it's actually pretty simple. Here's a simple analogy: a greedy king simply conquers everything without remorse. The greedy king does not look ahead of what's happening, and if any minister asks him to look ahead, he just doesn't care. A lazy king also conquers areas, but he is obsessed with overcontemplation. He takes his minister's advice to look ahead. He is in his mind a lot, computing the best approach. Anyway...
The moment was weird. The home still felt new, and it felt so much better.
It started from the lookahead. In college, as I was using an app to find the regexes, I could successfully extract the headers and use lookaheads.
After a while, I had to use Regex101, because the app didn't support "." capturing "\n".
In my college, I eventually found the pattern:
((?:(?:Header1|Header2|...|HeaderN)\:).*?)(?=(?:Header1|Header2|...|HeaderN)\:|\Z).*?)
If you observe carefully, I only kept the separation between the lookahead and the consuming. Also, *?
is our lazy king, *
is the greedy king (I learnt this today).
And it was kinda capturing text wonderfully. And honestly speaking, I knew the house I had entered. I felt as if I was destined to see this place again.
Found the photo album
Eventually, after I returned home from college, I rested. Then, I opened my laptop at 1 AM and started to make a similar pattern for separating the recipes in itself. After a lot of fiddling and tinkering, I eventually found all the patterns. Along with that, I refreshed my concepts on the re
module in Python.
Those few hours were one of the happiest hours of my life. As I tried to use [\W\d]+
to \W+\d+
, it looked great. Parsing the groups had to involve making the whole thing into a JSON file. Converting it into a bunch of SQL statements is for some different day.
I finally made the program, and it was properly pretty-printing the JSON stuff. Here's the relevant snippet of the Python program:
[...] recipe_obj = re.compile(r'(?:\-\-\-\n)?(^\W+\d+\.\s.*?$)(.*?)(?=(?:^\W+\d+\.\s.*?$)|\Z)', re.S | re.M) ab = recipe_obj.finditer(a) section_obj = re.compile(r'(?:(Ingredients|Method|Benefits|Calories|Cost)\:)(.*?)(?:\n)?(?=(?:(?:Ingredients|Method|Benefits|Calories|Cost)\:|\Z).*?)', re.S | re.M) recipes = {} for i, match in enumerate(ab): recipe = {} hehe = section_obj.finditer(match.group(2)) for sections in hehe: recipe[sections.group(1)] = sections.group(2) recipes[match.group(1)] = recipe pprint.pprint(recipes)
Here, re.S
means that "." also captures \n
, and re.M
means ^
and $
capture the start and end of each of the individual lines.
Do you notice something? Newline checks were added, Non-capturing characters were decisively placed, **, and the non-greedy .*?
is within a separate set of parentheses, all on its own.
While rummaging through the home of Regex, I somehow found the photo album.
...flipping through the pages
While doing all of this, I was in a literal flow state. Just the previous day, I was contemplating the future. I knew that I wouldn't be able to make it even after doing so many projects...
The next words might sound cliché, but if you want to read them, then you're good to go.
It's as if human words are complex, yet Regex actually does what it says. The patterns look less like an essay and more like a puzzle to crack.
I feel good when I work with regexes. I was doing what I love; I was plotting which place to exactly capture. I could create the regexes I wanted to.
Non-capturing groups with consuming characters make the experience a lot better. I love how I can make neat groups of strings. Such a clean system.
You don't want something in a certain outer parentheses, but you need the one in the inner parentheses? Just use a ?:
for the outer one, and use none for the inner one...
I hope the problems of our life could be given a non-capturing operator... I hope the things which matter in our life could be in a separate parentheses together, separate from the monotony and the sheer roboticness of the headers...
I hope some flags of our lives could give us rose-tinted perspectives on seeing the world.
The sheer elegance when a simple data entry becomes a regex problem is mindblowing. I love this, in its purest, dearest essence. I finally fell in love with the process.