<?xml version="1.0" encoding="UTF-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">
  <title>Eli Thorkelson</title>
  <subtitle>Tech Blog</subtitle>
  <id>https://decasia.org/tech</id>
  <link href="https://decasia.org/tech"/>
  <link href="https://decasia.org/feeds/tech.xml" rel="self"/>
  <updated>2026-04-03T20:07:00+00:00</updated>
  <author>
    <name>Eli Thorkelson</name>
  </author>
  <entry>
    <title>Bike safety rules in cities</title>
    <link rel="alternate" href="/tech/2026/04/bike-safety.html"/>
    <id>/tech/2026/04/bike-safety.html</id>
    <published>2026-04-03T20:07:00+00:00</published>
    <updated>2026-04-03T20:07:00+00:00</updated>
    <author>
      <name>Eli Thorkelson</name>
    </author>
    <summary type="html">&lt;p&gt;Outside of work, I really like to ride my bike. I’m not really interested in racing or the competitive side of cycling; but I love the experience of it, which always feels a little like flying, and I love how much you can see a city from a bike. It...&lt;/p&gt;</summary>
    <content type="html">&lt;p&gt;Outside of work, I really like to ride my bike. I&amp;rsquo;m not really interested in racing or the competitive side of cycling; but I love the experience of it, which always feels a little like flying, and I love how much you can see a city from a bike. It&amp;rsquo;s like a whole other way of seeing the world that you can&amp;rsquo;t get from walking or any other kind of transportation. I moved to Chicago when I was 23 and my bike was my main form of transportation for years. I never owned a car in those days; I just rode my bike everywhere. There was a year when I had a 13 mile daily bike commute — I don&amp;rsquo;t think I have ever been in such good shape. It&amp;rsquo;s world-changing, somehow, to spend so much time doing this activity.&lt;/p&gt;

&lt;p&gt;Unfortunately, riding a bicycle in a city can also lead to crashes. It is objectively somewhat risky.&lt;/p&gt;

&lt;p&gt;So every time I have any crash or incident, I&amp;rsquo;ve started to do a risk analysis afterwards, to see what I could do differently in the future. (Here, it starts to intersect with professional software development just a little bit, inasmuch as formal risk modeling is part of software engineering.)&lt;/p&gt;

&lt;p&gt;Ideally, if you find out what scenarios are likely to cause bike accidents, you can avoid putting yourself in those situations again.&lt;/p&gt;

&lt;p&gt;Over time, I&amp;rsquo;ve formed a list of rules for bike riding.&lt;/p&gt;

&lt;h2 id="basic-rules"&gt;Basic rules&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Reduce speed.&lt;/strong&gt; Speed is one of the biggest risk factors while biking, because it makes accidents more likely and it also makes them so much worse. Kinetic energy is proportional to the square of your velocity, so every time you double your velocity, a crash will have 4x as much energy to dissipate into your body. Also, speed reduces your reaction time; the faster you go, the less time you will have to react to any approaching risks. So, Rule 1: &lt;strong&gt;don&amp;rsquo;t go super fast&lt;/strong&gt;. (I enforce this for myself on my road bikes, which have front and rear gears, by almost never shifting the front shifter into the biggest (outer) ring - I figure that the increased safety is worth going a little slower.)&lt;/li&gt;
&lt;li&gt;Scan the road surface carefully and continuously. Mentally note any bumps, dips, cracks, grooves, potholes, drains, grates, speed bumps, or any other imperfections in your path, so that you have a detailed picture of exactly what risks are approaching on the roadway. Over time, you can develop intuitions about how your bike will interact with any given obstacle (&amp;ldquo;this small bump will be fine,&amp;rdquo; &amp;ldquo;this sudden dip in the road looks dangerous,&amp;rdquo; or &amp;ldquo;this is a huge bump and it&amp;rsquo;s too late to avoid it, &lt;em&gt;eep, hold on tight&lt;/em&gt;&amp;rdquo;).&lt;/li&gt;
&lt;li&gt;Constantly scan the environment for all other vehicles, bikes, pedestrians, or any other moving objects. Estimate when or if they might intersect your bike&amp;rsquo;s future trajectory, or, worse, if they might hit you based on their current path.&lt;/li&gt;
&lt;li&gt;Scan &lt;em&gt;behind&lt;/em&gt; yourself for any approaching vehicles. Especially check when you hear noises, or if you are considering moving farther left into the travel lane.&lt;/li&gt;
&lt;li&gt;When possible, make eye contact with approaching people or vehicles, or wave at them. You want to &lt;strong&gt;communicate&lt;/strong&gt; with people so you can establish mutual awareness and (ideally) respect. You want to be sure that &lt;em&gt;they know you are there&lt;/em&gt;, and if possible, ensure that they are planning not to hit you. I usually try to make eye contact first. If I&amp;rsquo;m not sure that eye contact is working, I might wave at them too. People&amp;rsquo;s eyes tend to notice motion.&lt;/li&gt;
&lt;li&gt;Choose your trajectory carefully and precisely on the road. Choose your path to avoid obstacles and minimize risks, given everything you know about the situation around you, and given your state of communication with other people and vehicles. (The longer I ride my bike, the easier it is to place it precisely where I want it to go on the road.) You might want to ride closer or farther to the curb; you might want to avoid obstacles or go over them; you might want to slow down or speed up. The point is, you are making all these choices in view of the situation.&lt;/li&gt;
&lt;li&gt;There is a flip side to the last point: &lt;em&gt;Don&amp;rsquo;t change your trajectory without knowing what&amp;rsquo;s happening around you, especially what&amp;rsquo;s behind you&lt;/em&gt;. Just because you see an obstacle in your path, this does not mean it is safe to swerve out of the way - you could be swerving into a new problem. (In emergencies, you might need to swerve without being absolutely sure of what&amp;rsquo;s behind you; this is a risky choice, even when it&amp;rsquo;s necessary.)&lt;/li&gt;
&lt;li&gt;If you spot a dangerous situation like a possible collision approaching, remember that you have more than one option for how to avoid it. Obviously, you can hit the brakes and slow down. But you can also steer out of the way or, unintuitively, you can also speed up. (Speeding up is a perfectly valid strategy for avoiding collisions.) Obviously in a real emergency you won&amp;rsquo;t have much time to think about your options, but it&amp;rsquo;s good to know in theory that you have more options than just hitting the brakes.&lt;/li&gt;
&lt;li&gt;It is OK to just pull over and stop momentarily if you need to. For instance, sometimes when an obstacle is approaching (a parked car in an otherwise empty street), instead of moving farther out into the road and going around it, I pull over and stop for a few seconds to let traffic pass me, and then continue.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2 id="intersections-and-traffic"&gt;Intersections and traffic&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;Respect traffic signs and signals the same as a car, especially if there are any cars around whatsoever. It makes drivers mad at you if you openly ignore the traffic signals and stop signs. It is also just safer and easier for everyone if you respect the traffic rules. This is one way we can help to create a consistent, predictable environment.&lt;/li&gt;
&lt;li&gt;Intersections are generally high risk places. Always assume that car drivers &lt;em&gt;don&amp;rsquo;t&lt;/em&gt; see you until they have proven otherwise. Assume people might ignore the traffic signals (for example, they might keep going after the stoplight turns red). Don&amp;rsquo;t enter an intersection until you are sure it is safe.&lt;/li&gt;
&lt;li&gt;It&amp;rsquo;s often safer at intersections to join the vehicle traffic and wait your turn than to try to dodge along the side of a line of stopped cars. You do have a legal right to do this as well, for what that&amp;rsquo;s worth.&lt;/li&gt;
&lt;li&gt;Driveways are also dangerous, though not as dangerous as intersections, because people entering or leaving a driveway don&amp;rsquo;t always see you. Check driveways as you pass for signs of motion. Slow down if you are approaching a driveway and can&amp;rsquo;t see what&amp;rsquo;s inside, like if it is behind a construction fence or a gate.&lt;/li&gt;
&lt;li&gt;Never ride on sidewalks unless you are accompanying kids on a bike ride.&lt;/li&gt;
&lt;li&gt;Bike lanes at the edge of the road are a missed blessing. Cars don&amp;rsquo;t always pay attention to them. People tend to park across them, meaning you will have to go back out into the vehicle lane. Bike lanes tend to have more debris, fallen leaves, and broken glass. Don&amp;rsquo;t avoid bike lanes, but use caution.&lt;/li&gt;
&lt;li&gt;Especially beware of bike lanes at intersections, because any traffic turning will have to cross the bike lane and may not see you.&lt;/li&gt;
&lt;li&gt;Consider traffic patterns when monitoring the road. For example, during rush hour, there will not only be more traffic on the roads, there will also be a higher rate of people entering and exiting driveways and parking lots, so you have to be extra cautious on both sides.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2 id="relationships-with-drivers"&gt;Relationships with drivers&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;More communication is usually better. Don&amp;rsquo;t hesitate to communicate.&lt;/li&gt;
&lt;li&gt;Make noise as needed for safety reasons. Yell something at drivers to help them notice you, shout at pedestrians to watch out, whatever makes things safer. It&amp;rsquo;s better to be a little bit rude than for someone to get hurt.&lt;/li&gt;
&lt;li&gt;It&amp;rsquo;s usually better to be extra nice to people (like let them go first, let them pass you, generally try to help them) than otherwise, and it probably makes you a little safer.&lt;/li&gt;
&lt;li&gt;Signal your left or right turns with your hands. It&amp;rsquo;s safer for you if other people understand your trajectory.&lt;/li&gt;
&lt;li&gt;Drivers may try to pass you in dangerous ways. You can try to help avert catastrophes. For instance, if a car is trying to pass you from behind, and you&amp;rsquo;re ahead of them on a curve, it&amp;rsquo;s smart to signal the car with hand signals whether it&amp;rsquo;s safe to proceed. (I have seen some near misses because drivers moved into the oncoming traffic lane to pass me, but someone was coming.)&lt;/li&gt;
&lt;li&gt;Think about what messages you are sending to other vehicles because of how you ride. Sometimes I deliberately wobble just a little bit around the road, to make approaching vehicles think I might be a little bit incompetent and they should give me a little more space. Some drivers seem to pass a little closer if you are going in an even, straight line.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2 id="failures-of-communication"&gt;Failures of communication&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;Don&amp;rsquo;t assume that drivers follow the rules about how to take turns at stop signs. A lot of people will just go once they have a brief stop, even if it isn&amp;rsquo;t supposed to be their turn.&lt;/li&gt;
&lt;li&gt;Any driver who is looking at their phone is an existential risk to your life and should be monitored with extreme caution. Such drivers probably don&amp;rsquo;t see you, their performance is impaired, and you probably can&amp;rsquo;t communicate with them at all.&lt;/li&gt;
&lt;li&gt;I am scared of self-driving cars, as a cyclist, because I don&amp;rsquo;t know how to communicate with them, or if it is even possible to communicate with them.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2 id="parked-cars"&gt;Parked cars&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;Never ride close to parked cars, because their doors can open in front of you at any time, potentially causing a horrible crash.&lt;/li&gt;
&lt;li&gt;If in spite of the previous rule, you do have to ride close to parked cars, try to watch them to see if anyone is inside them. You can sometimes spot people&amp;rsquo;s heads through the rear windshields or windows, giving some clues about whether cars are occupied.&lt;/li&gt;
&lt;li&gt;If you do have to ride close to parked cars, it is also wise to slow down.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2 id="road-shapes"&gt;Road shapes&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;Always leave space between yourself and the very edges of the road. You can crash if you rub up against a curb. You can also crash by slipping off the side of a roadway that is built up above the surroundings (this is more common in rural environments with thick layers of asphalt).&lt;/li&gt;
&lt;li&gt;Always cross railroad tracks at a right angle to the rails. If you approach at an oblique (near-parallel) angle, your wheel can get trapped in the groove between the rail and the roadway, causing an instant crash. (Ask me how I know.)&lt;/li&gt;
&lt;li&gt;Always be very cautious about &lt;em&gt;anything&lt;/em&gt; in the road whose surface is nearly parallel to your direction of travel, because bike wheels are not able to safely negotiate such surfaces. For example, if there is a groove in the road that is nearly parallel to the way you are going, avoid it; your wheels can slip into it and tip you over. Cross grooved or lipped shapes at perpendicular angles if at all possible.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2 id="relationships-with-pedestrians"&gt;Relationships with pedestrians&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;Always defer to pedestrians, but especially if they are kids, parents with strollers, or otherwise more vulnerable people.&lt;/li&gt;
&lt;li&gt;Anyone running or jogging is a special danger to bikes, because they tend to run out into the road without looking, and they move fast compared to other pedestrians.&lt;/li&gt;
&lt;li&gt;If small kids say hi to you, always say hi back to them. (This one isn&amp;rsquo;t about safety, just about humanity.)&lt;/li&gt;
&lt;li&gt;In general, bike riding can be a nice way to have some tiny, fleeting encounters with strangers, even if it&amp;rsquo;s just a smile or two in passing.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2 id="miscellaneous-risks"&gt;Miscellaneous risks&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;Beware darkness. It&amp;rsquo;s easier to spot vehicles in the dark (because of their headlights), but it makes it hard to see the surface of the road in detail.&lt;/li&gt;
&lt;li&gt;Beware slippery conditions. Slippery road surfaces are probably the second most important risk factor, especially if you ride a road bike with skinny tires, since traction from the bike wheels is greatly reduced when it is raining. So, increase caution and reduce speed whenever the roads are wet or snowy. Increase caution even more when it&amp;rsquo;s cold if there is any ice on the road.&lt;/li&gt;
&lt;li&gt;Don&amp;rsquo;t carry anything in your hands while you bike. This can make it harder to handle the bike in complex moments, leading to crashes.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2 id="gear"&gt;Gear&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;Always wear a bike helmet (even though they are kind of annoying). My cousin died because he didn&amp;rsquo;t wear a bike helmet and someone opened their car door in his path.&lt;/li&gt;
&lt;li&gt;Get a really bright headlight. I once asked some Chicago bicycle cops what brand they used (it was ludicrously bright) and then I bought that. (For the record, it was &lt;a href="https://magicshine.com/"&gt;Magic Shine&lt;/a&gt;).&lt;/li&gt;
&lt;li&gt;Always have a back light. Probably have it flash for extra visibility.&lt;/li&gt;
&lt;li&gt;It is strongly encouraged to have a bell, horn, or other audible signal.&lt;/li&gt;
&lt;li&gt;For extra visibility, decorate your helmet, backpack, etc, with extra reflectors, shiny patches or extra lights.&lt;/li&gt;
&lt;li&gt;If possible, wear bright or even fluorescent/high viz clothes. (I don&amp;rsquo;t always do this, but it is definitely a safety factor.)&lt;/li&gt;
&lt;li&gt;Consider environmental issues, especially on longer rides - like bring enough (or any) water if it is warm, consider sunscreen and UV protection on exposed skin, always have gloves in cool temps to keep your hands warm enough, all this. Being comfortable can become a safety factor, because discomfort can make you more agitated or inattentive.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2 id="accidents"&gt;Accidents&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;Don&amp;rsquo;t expect that anyone will necessarily help you if you crash. People have this horrible tendency to just keep going and not&amp;hellip; do anything to help.&lt;/li&gt;
&lt;li&gt;Maybe carry some really basic first aid stuff with you, like some bandaids in case you have minor scrapes that need covering&lt;/li&gt;
&lt;li&gt;Don&amp;rsquo;t be afraid to ask for help if you need it, especially in emergencies. One time I crashed my bike because of a sudden dip in a road at night, and I ended up staggering to the first house nearby with a light on the porch. They ended up driving me home with my bike. People can be really kind.&lt;/li&gt;
&lt;li&gt;However, sometimes people are awful and will turn you away even if you&amp;rsquo;re hurt and ask them for help. I don&amp;rsquo;t even know what to say about those people, just know that unfortunately they are out there.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2 id="meta"&gt;Meta&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;Commit to reducing overall risks over time, by learning better safety rules and writing mental lists like this one.&lt;/li&gt;
&lt;li&gt;Accept that biking is dangerous, no matter how you do it, but it&amp;rsquo;s worth it anyway.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2 id="caveats"&gt;Caveats:&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;These rules work for me, but they are not intended to be universal.&lt;/li&gt;
&lt;li&gt;They probably don&amp;rsquo;t make sense if you are in very different circumstances from me.&lt;/li&gt;
&lt;li&gt;I might edit all these again in the future.&lt;/li&gt;
&lt;li&gt;For transparency, this was last edited &lt;strong&gt;April 3, 2026&lt;/strong&gt;.&lt;/li&gt;
&lt;/ul&gt;
</content>
  </entry>
  <entry>
    <title>Against the Five Whys</title>
    <link rel="alternate" href="/tech/2026/02/problem-with-five-whys.html"/>
    <id>/tech/2026/02/problem-with-five-whys.html</id>
    <published>2026-02-28T13:51:00+00:00</published>
    <updated>2026-02-28T13:51:00+00:00</updated>
    <author>
      <name>Eli Thorkelson</name>
    </author>
    <summary type="html">&lt;p&gt;My org has recently started using the Toyota &lt;a href="https://en.wikipedia.org/wiki/Five_whys"&gt;Five Whys&lt;/a&gt; to analyze incidents.&lt;/p&gt;

&lt;p&gt;The way we use it, it looks something like this:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Q1: Why did we see &lt;em&gt;A&lt;/em&gt;?&lt;/li&gt;
&lt;li&gt;A1: Because &lt;em&gt;B&lt;/em&gt;.&lt;/li&gt;
&lt;li&gt;Q2: Why did &lt;em&gt;B&lt;/em&gt;?&lt;/li&gt;
&lt;li&gt;A2: Because &lt;em&gt;C&lt;/em&gt;.&lt;/li&gt;
&lt;li&gt;Q3: Why did &lt;em&gt;C&lt;/em&gt;?&lt;/li&gt;
&lt;li&gt;A3: Because &lt;em&gt;D&lt;/em&gt;.&lt;/li&gt;
&lt;li&gt;Q4: Why did &lt;em&gt;D&lt;/em&gt;?&lt;/li&gt;
&lt;li&gt;A4: Because...&lt;/li&gt;
&lt;/ul&gt;</summary>
    <content type="html">&lt;p&gt;My org has recently started using the Toyota &lt;a href="https://en.wikipedia.org/wiki/Five_whys"&gt;Five Whys&lt;/a&gt; to analyze incidents.&lt;/p&gt;

&lt;p&gt;The way we use it, it looks something like this:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Q1: Why did we see &lt;em&gt;A&lt;/em&gt;?&lt;/li&gt;
&lt;li&gt;A1: Because &lt;em&gt;B&lt;/em&gt;.&lt;/li&gt;
&lt;li&gt;Q2: Why did &lt;em&gt;B&lt;/em&gt;?&lt;/li&gt;
&lt;li&gt;A2: Because &lt;em&gt;C&lt;/em&gt;.&lt;/li&gt;
&lt;li&gt;Q3: Why did &lt;em&gt;C&lt;/em&gt;?&lt;/li&gt;
&lt;li&gt;A3: Because &lt;em&gt;D&lt;/em&gt;.&lt;/li&gt;
&lt;li&gt;Q4: Why did &lt;em&gt;D&lt;/em&gt;?&lt;/li&gt;
&lt;li&gt;A4: Because &lt;em&gt;E&lt;/em&gt;.&lt;/li&gt;
&lt;li&gt;Q5: Why did &lt;em&gt;E&lt;/em&gt;?&lt;/li&gt;
&lt;li&gt;A5: Because &lt;em&gt;F&lt;/em&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The idea is that you get deeper into the process failures that lie behind a problem when you dig in like this.&lt;/p&gt;

&lt;p&gt;But every time I write an RCA document, I find that the Five Whys is inadequate.&lt;/p&gt;

&lt;h2 id="why-the-five-whys-is-inadequate"&gt;Why the Five Whys is inadequate&lt;/h2&gt;

&lt;p&gt;It is inadequate because it presumes a basically linear causality.&lt;/p&gt;

&lt;p&gt;Almost all interesting incidents have &lt;em&gt;multiple causality&lt;/em&gt;. Meaning that there isn&amp;rsquo;t just a straight line between A and F. There is a cluster of issues.&lt;/p&gt;

&lt;p&gt;When I have to fill out this document, I usually hack the format by listing several causes under A2. And I put several causes under A3. And so on.&lt;/p&gt;

&lt;p&gt;In this way, you can respect the form of the Five Whys while also having the intellectual integrity to acknowledge that &lt;em&gt;causality is almost always multiple&lt;/em&gt; and incidents happen at the junction between multiple causal flows.&lt;/p&gt;

&lt;h2 id="alternative-approaches"&gt;Alternative approaches&lt;/h2&gt;

&lt;p&gt;I&amp;rsquo;m not the first to criticize the Five Whys; see &lt;a href="https://en.wikipedia.org/wiki/Five_whys#Criticism"&gt;the criticism section in the Wikipedia article&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;It turns out that researchers have already invented better forms of analysis that represent multiple forces better.&lt;/p&gt;

&lt;p&gt;One approach is the &amp;ldquo;fishbone diagram&amp;rdquo;: it shows multiple forces that collectively &lt;em&gt;caused&lt;/em&gt; an incident.&lt;/p&gt;

&lt;p&gt;&lt;img src="/images/tech/fishbone-diagram.png" alt="fishbone diagram"&gt;&lt;/p&gt;

&lt;p&gt;The idea is that lots of things had to collectively go wrong to cause an incident, and you diagram them in layers.&lt;/p&gt;

&lt;p&gt;An even better approach is described in Alan J. Card&amp;rsquo;s 2013 paper, &lt;a href="https://www.researchgate.net/publication/270598841_A_new_tool_for_hazard_analysis_and_force-field_analysis_The_Lovebug_diagram"&gt;A new tool for hazard analysis and force field analysis: The Lovebug Diagram&lt;/a&gt;. A &amp;ldquo;lovebug diagram&amp;rdquo; is basically two fishbone diagrams that represent opposing directions. The name comes from &amp;ldquo;a mating pair of &lt;em&gt;Plecia nearctica&lt;/em&gt;, commonly known as lovebugs.&amp;rdquo;&lt;/p&gt;

&lt;p&gt;&lt;img src="/images/tech/lovebug-diagram.png" alt="lovebug diagram"&gt;&lt;/p&gt;

&lt;p&gt;(&lt;a href="https://journals.sagepub.com/doi/10.1177/1356262213510855"&gt;Published version here&lt;/a&gt;, but I don&amp;rsquo;t have access anymore because I&amp;rsquo;m not in academia :/)&lt;/p&gt;

&lt;p&gt;It&amp;rsquo;s a great visualization because it shows multiple forces that &lt;em&gt;caused&lt;/em&gt; an incident and also multiple forces that tried to &lt;em&gt;prevent&lt;/em&gt; an incident. And this, in my mind, is a more realistic image of how things actually happen. We do have safety controls. In an incident, it&amp;rsquo;s just that they weren&amp;rsquo;t enough.&lt;/p&gt;

&lt;p&gt;I wish I could use these diagrams in our RCAs instead of Five Whys. But I guess the Five Whys is better than no methodology whatsoever, or just letting people write down that an incident happened because of one single reason. Methodology is better when it provides a &lt;em&gt;lower bound&lt;/em&gt; for our inquiries, but it doesn&amp;rsquo;t have to be an &lt;em&gt;upper bound&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;Anyway, causality is always multiple in complex systems. I won&amp;rsquo;t stop insisting on this.&lt;/p&gt;
</content>
  </entry>
  <entry>
    <title>Ruby 'require' can deadlock</title>
    <link rel="alternate" href="/tech/2026/02/require-can-deadlock.html"/>
    <id>/tech/2026/02/require-can-deadlock.html</id>
    <published>2026-02-15T17:19:00+00:00</published>
    <updated>2026-02-15T17:19:00+00:00</updated>
    <author>
      <name>Eli Thorkelson</name>
    </author>
    <summary type="html">&lt;p&gt;Another short post with a surprising thing to learn about Ruby.&lt;/p&gt;

&lt;p&gt;We had a threaded service that called &lt;code&gt;require&lt;/code&gt; on a certain directory at runtime. It was expected to run the service only on one thread at a time.&lt;/p&gt;

&lt;p&gt;My mental model of this was:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;require&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;</summary>
    <content type="html">&lt;p&gt;Another short post with a surprising thing to learn about Ruby.&lt;/p&gt;

&lt;p&gt;We had a threaded service that called &lt;code&gt;require&lt;/code&gt; on a certain directory at runtime. It was expected to run the service only on one thread at a time.&lt;/p&gt;

&lt;p&gt;My mental model of this was:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;require&lt;/code&gt; is basically a no op once a file is already loaded&lt;/li&gt;
&lt;li&gt;so it doesn&amp;rsquo;t cost much to run &lt;code&gt;require&lt;/code&gt; each time the threaded service executes.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;One day, due to some weirdness in a testing environment, this threaded service ran on multiple threads simultaneously.&lt;/p&gt;

&lt;p&gt;Our Sidekiq worker came to a halt.&lt;/p&gt;

&lt;p&gt;My colleague &lt;a href="https://dmytro.sh"&gt;Dmytro&lt;/a&gt; found out that, if you try to &lt;code&gt;require&lt;/code&gt; the same directory from multiple threads simultaneously, the process can deadlock.&lt;/p&gt;

&lt;h2 id="reproducing"&gt;Reproducing&lt;/h2&gt;

&lt;p&gt;A simple reproduction looks like this:&lt;/p&gt;

&lt;pre&gt;&lt;code class="ruby"&gt;# alpha.rb
puts &amp;quot;[Thread A] Inside alpha.rb, sleeping before requiring beta...&amp;quot;
sleep 1
require_relative &amp;#39;beta&amp;#39;
puts &amp;quot;[Thread A] alpha.rb finished.&amp;quot;

# beta.rb
puts &amp;quot;[Thread B] Inside beta.rb, sleeping before requiring alpha...&amp;quot;
sleep 1
require_relative &amp;#39;alpha&amp;#39;
puts &amp;quot;[Thread B] beta.rb finished.&amp;quot;

# experiment.rb
puts &amp;quot;--- Starting Deadlock Reproduction ---&amp;quot;

t1 = Thread.new do
  puts &amp;quot;Thread 1: Requiring &amp;#39;alpha&amp;#39;...&amp;quot;
  require_relative &amp;#39;alpha&amp;#39;
end

t2 = Thread.new do
  puts &amp;quot;Thread 2: Requiring &amp;#39;beta&amp;#39;...&amp;quot;
  require_relative &amp;#39;beta&amp;#39;
end

[t1, t2].each(&amp;amp;:join)
puts &amp;quot;This will never print because of the deadlock.&amp;quot;
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;This prints the following when I test it:&lt;/p&gt;

&lt;pre&gt;&lt;code class="bash"&gt;--- Starting Deadlock Reproduction ---
Thread 1: Requiring &amp;#39;alpha&amp;#39;...
[Thread A] Inside alpha.rb, sleeping before requiring beta...
Thread 2: Requiring &amp;#39;beta&amp;#39;...
[Thread B] Inside beta.rb, sleeping before requiring alpha...
experiment.rb:13:in &amp;#39;Thread#join&amp;#39;: No live threads left. Deadlock? (fatal)
3 threads, 3 sleeps current:0x0000000aa933b400 main thread:0x00000001030b86e0
* #&amp;lt;Thread:0x00000001023e8d18 sleep_forever&amp;gt;
   rb_thread_t:0x00000001030b86e0 native:0x0000000200c23080 int:0
   experiment.rb:13:in &amp;#39;Thread#join&amp;#39;
   experiment.rb:13:in &amp;#39;Array#each&amp;#39;
   experiment.rb:13:in &amp;#39;&amp;lt;main&amp;gt;&amp;#39;
* #&amp;lt;Thread:0x0000000121df67c8 experiment.rb:3 sleep_forever&amp;gt;
   rb_thread_t:0x0000000aa933b200 native:0x000000016dcbf000 int:0 mutex:3 cond:1
    depended by: tb_thread_id:0x00000001030b86e0
   /Users/eli/scratch/require-deadlock/alpha.rb:3:in &amp;#39;Kernel#require_relative&amp;#39;
   /Users/eli/scratch/require-deadlock/alpha.rb:3:in &amp;#39;&amp;lt;top (required)&amp;gt;&amp;#39;
   experiment.rb:5:in &amp;#39;Kernel#require_relative&amp;#39;
   experiment.rb:5:in &amp;#39;block in &amp;lt;main&amp;gt;&amp;#39;
* #&amp;lt;Thread:0x0000000121df66b0 experiment.rb:8 sleep_forever&amp;gt;
   rb_thread_t:0x0000000aa933b400 native:0x000000016ddcb000 int:0
   /Users/eli/scratch/require-deadlock/beta.rb:3:in &amp;#39;Kernel#require_relative&amp;#39;
   /Users/eli/scratch/require-deadlock/beta.rb:3:in &amp;#39;&amp;lt;top (required)&amp;gt;&amp;#39;
   experiment.rb:10:in &amp;#39;Kernel#require_relative&amp;#39;
   experiment.rb:10:in &amp;#39;block in &amp;lt;main&amp;gt;&amp;#39;

    from experiment.rb:13:in &amp;#39;Array#each&amp;#39;
    from experiment.rb:13:in &amp;#39;&amp;lt;main&amp;gt;&amp;#39;
&lt;/code&gt;&lt;/pre&gt;

&lt;h2 id="why-does-this-happen"&gt;Why does this happen&lt;/h2&gt;

&lt;p&gt;Well, short answer, there is a lock inside &lt;code&gt;require&lt;/code&gt; and multiple threads can deadlock when they contend for requiring the same resources.&lt;/p&gt;

&lt;p&gt;Long answer — I have not had time to look into the Ruby internals around this, but the Ruby Hacking Guide does describe some of the mechanisms.&lt;/p&gt;

&lt;p&gt;From &lt;a href="https://ruby-hacking-guide.github.io/load.html"&gt;Chapter 18, Loading&lt;/a&gt;:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;The problem comes after. Like the comment says “the loading of Ruby programs is serialised”. In other words, a file can only be loaded from one thread, and if during the loading another thread tries to load the same file, that thread will wait for the first loading to be finished [&amp;hellip;]&lt;/p&gt;

&lt;p&gt;The process to enter the waiting state is simple. A &lt;code&gt;st_table&lt;/code&gt; is created in &lt;code&gt;loading_tbl&lt;/code&gt;, the association “&lt;code&gt;feature=&amp;gt;&lt;/code&gt;waiting thread” is recorded in it. &lt;code&gt;curr_thread&lt;/code&gt; is in &lt;code&gt;eval.c&lt;/code&gt;’s functions, its value is the current running thread.&lt;/p&gt;

&lt;p&gt;The mechanism to enter the waiting state is very simple. A &lt;code&gt;st_table&lt;/code&gt; is created in the &lt;code&gt;loading_tbl&lt;/code&gt; global variable, and a “&lt;code&gt;feature&lt;/code&gt;=&amp;gt;&lt;code&gt;loading thread&lt;/code&gt;” association is created. &lt;code&gt;curr_thread&lt;/code&gt; is a variable from &lt;code&gt;eval.c&lt;/code&gt;, and its value is the currently running thread. That makes an exclusive lock. And in &lt;code&gt;rb_feature_p()&lt;/code&gt;, we wait for the loading thread to end like the following.&lt;/p&gt;

&lt;p&gt;When &lt;code&gt;rb_thread_schedule()&lt;/code&gt; is called, the control is transferred to an other thread, and this function only returns after the control returned back to the thread where it was called. When the file name disappears from &lt;code&gt;loading_tbl&lt;/code&gt;, the loading is finished so the function can end. The &lt;code&gt;curr_thread&lt;/code&gt; check is not to lock itself (figure 1).&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;(I have not checked if this is still 100% accurate; the Ruby Hacking Guide is old.)&lt;/p&gt;

&lt;h2 id="fix"&gt;Fix&lt;/h2&gt;

&lt;p&gt;We changed our code to avoid dynamic &lt;code&gt;require&lt;/code&gt; calls at runtime. Instead we loaded the necessary code at start time.&lt;/p&gt;

&lt;p&gt;(It is code that does not need to be loaded in all scenarios, so that&amp;rsquo;s why it was not loaded by default in the first place. It is pointless to load code when it isn&amp;rsquo;t needed; but then the hard part is to reliably know when it is needed.)&lt;/p&gt;

&lt;p&gt;Anyway, the point is, I just had no idea that &lt;code&gt;require&lt;/code&gt; had an underlying lock. It&amp;rsquo;s hard to plan around things you don&amp;rsquo;t know exist.&lt;/p&gt;
</content>
  </entry>
  <entry>
    <title>Raise isn't a Ruby reserved word</title>
    <link rel="alternate" href="/tech/2026/02/raise-not-a-reserved-word.html"/>
    <id>/tech/2026/02/raise-not-a-reserved-word.html</id>
    <published>2026-02-07T11:48:00+00:00</published>
    <updated>2026-02-07T11:48:00+00:00</updated>
    <author>
      <name>Eli Thorkelson</name>
    </author>
    <summary type="html">&lt;p&gt;I was surprised to read this in the &lt;a href="https://ruby-hacking-guide.github.io/spec.html"&gt;Ruby Hacking Guide&lt;/a&gt;:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;In Ruby exceptions come in the form of the function style method &lt;code&gt;raise&lt;/code&gt;.
&lt;code&gt;raise&lt;/code&gt; is not a reserved word.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Wait, what?&lt;/p&gt;

&lt;p&gt;… &lt;code&gt;raise&lt;/code&gt; is not a reserved word in Ruby?&lt;/p&gt;

&lt;p&gt;It sure seems like part of the core...&lt;/p&gt;</summary>
    <content type="html">&lt;p&gt;I was surprised to read this in the &lt;a href="https://ruby-hacking-guide.github.io/spec.html"&gt;Ruby Hacking Guide&lt;/a&gt;:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;In Ruby exceptions come in the form of the function style method &lt;code&gt;raise&lt;/code&gt;.
&lt;code&gt;raise&lt;/code&gt; is not a reserved word.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Wait, what?&lt;/p&gt;

&lt;p&gt;&amp;hellip; &lt;code&gt;raise&lt;/code&gt; is not a reserved word in Ruby?&lt;/p&gt;

&lt;p&gt;It sure seems like part of the core structure of the language&amp;hellip;&lt;/p&gt;

&lt;p&gt;Let&amp;rsquo;s check the &lt;a href="https://docs.ruby-lang.org/en/4.0/syntax/keywords_rdoc.html"&gt;official list of Ruby keywords&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;The following keywords start with R:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;redo&lt;/code&gt; - Restarts execution in the current block. See control expressions&lt;/p&gt;

&lt;p&gt;&lt;code&gt;rescue&lt;/code&gt; - Starts an exception section of code in a begin block. See exception handling&lt;/p&gt;

&lt;p&gt;&lt;code&gt;retry&lt;/code&gt; - Retries an exception block. See exception handling&lt;/p&gt;

&lt;p&gt;&lt;code&gt;return&lt;/code&gt; - Exits a method. See methods. If met in top-level scope, immediately stops interpretation of the current file.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This is so weird. &lt;code&gt;raise&lt;/code&gt; isn&amp;rsquo;t there.&lt;/p&gt;

&lt;p&gt;Let&amp;rsquo;s test:&lt;/p&gt;

&lt;pre&gt;&lt;code class="ruby"&gt;irb(main):001* def raise(beep)
irb(main):002*   puts &amp;quot;boop&amp;quot;
irb(main):003&amp;gt; end
=&amp;gt; :raise
irb(main):004&amp;gt; boop
boop
boop
boop
               rbboop
boop           ri                      █
boop           raise                   █
boop           rand
               raiboope
boop           raise
boop
irb(main):004&amp;gt; raise 2
boop
=&amp;gt; nil
irb(main):005&amp;gt; boop
boop
boop
boop
               rbboop
boop           ri                      █
boop           raise                   █
boop           rand
               raiboope
boop           raise
boop
irb(main):005&amp;gt; raise 6
boop
=&amp;gt; nil
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;So yes, &lt;code&gt;raise&lt;/code&gt; is just a method, but redefine it at your own risk, because my simple definition here gets us deeply into &lt;a href="https://en.wikipedia.org/wiki/Zalgo_text"&gt;Zalgo text behavior&lt;/a&gt;. It looks like IRB uses &lt;code&gt;raise&lt;/code&gt; internally as part of its autocomplete functionality. It&amp;rsquo;s fun to see the side effect happening constantly while entering text.&lt;/p&gt;

&lt;p&gt;If we redefine raise &lt;em&gt;without any arguments&lt;/em&gt;, then the effect doesn&amp;rsquo;t occur. Presumably there is an internal &lt;code&gt;ArgumentError&lt;/code&gt; before it enters the method body.&lt;/p&gt;

&lt;pre&gt;&lt;code class="ruby"&gt;irb(main):001* def raise
irb(main):002*   puts &amp;quot;zalgo&amp;quot;
irb(main):003&amp;gt; end
=&amp;gt; :raise
irb(main):004&amp;gt; puts &amp;quot;test&amp;quot;
test
=&amp;gt; nil
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;OK, what else can you define here?&lt;/p&gt;

&lt;pre&gt;&lt;code class="ruby"&gt;irb(main):001* def rescue
irb(main):002*   puts &amp;quot;help!&amp;quot;
irb(main):003&amp;gt; end
=&amp;gt; :rescue
irb(main):004&amp;gt; begin; raise &amp;quot;problem&amp;quot;; rescue =&amp;gt; e; puts e.inspect; end
#&amp;lt;RuntimeError: problem&amp;gt;
=&amp;gt; nil
irb(main):005&amp;gt; send :rescue
help!
=&amp;gt; nil
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;So &lt;code&gt;rescue&lt;/code&gt; is a valid method name, but, reasonably enough, the &lt;code&gt;rescue&lt;/code&gt; keyword takes precedence over it if you do define it.&lt;/p&gt;

&lt;p&gt;The same is true for other keywords - you can define them as methods if you want, but then you can&amp;rsquo;t call them normally.&lt;/p&gt;

&lt;pre&gt;&lt;code class="ruby"&gt;irb(main):034* def while(condition)
irb(main):035*   loop do
irb(main):037*     break unless condition.call
irb(main):036*     yield
irb(main):038*   end
irb(main):039&amp;gt; end
=&amp;gt; :while
irb(main):040* send :while, -&amp;gt; { Kernel.rand &amp;lt; 0.5 } do
irb(main):041*   print &amp;quot;.&amp;quot;
irb(main):042&amp;gt; end
# This prints a random number of dots and then halts.
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Congratulations, we just reimplemented &lt;code&gt;while&lt;/code&gt; without using &lt;code&gt;while&lt;/code&gt;. Admittedly, it is useless, unintuitive, and dangerous if you do this in real application code. But in any case, there is no particular rule against using a reserved word for a method name.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://docs.ruby-lang.org/en/4.0/syntax/methods_rdoc.html"&gt;Per the docs&lt;/a&gt;:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Method names may be one of the operators or must start a letter or a character with the eighth bit set. It may contain letters, numbers, an _ (underscore or low line) or a character with the eighth bit set. The convention is to use underscores to separate words in a multiword method name:&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;I&amp;rsquo;m imagining that in the Ruby parser implementation, whatever you type after &lt;code&gt;def&lt;/code&gt; is parsed only as a Ruby method name and not interpreted as a possible keyword.&lt;/p&gt;

&lt;p&gt;Meanwhile, &lt;a href="https://docs.ruby-lang.org/en/4.0/Kernel.html#method-i-raise"&gt;raise&lt;/a&gt; is implemented as a method on the &lt;code&gt;Kernel&lt;/code&gt; module that is included by &lt;code&gt;Object&lt;/code&gt;, so that&amp;rsquo;s why it is both ubiquitous and stupid to redefine.&lt;/p&gt;

&lt;pre&gt;&lt;code class="c"&gt;// ruby/eval.c

rb_define_global_function(&amp;quot;raise&amp;quot;, f_raise, -1);

// then f_raise delegates to rb_f_raise

static VALUE
f_raise(int c, VALUE *v, VALUE _)
{
    return rb_f_raise(c, v);
}

// which is implemented like this:
VALUE
rb_f_raise(int argc, VALUE *argv)
{
    VALUE cause = Qundef;
    argc = extract_raise_options(argc, argv, &amp;amp;cause);

    VALUE exception;

    // Bare re-raise case:
    if (argc == 0) {
        // Cause was extracted, but no arguments were provided:
        if (!UNDEF_P(cause)) {
            rb_raise(rb_eArgError, &amp;quot;only cause is given with no arguments&amp;quot;);
        }

        // Otherwise, re-raise the current exception:
        exception = get_errinfo();
        if (!NIL_P(exception)) {
            argc = 1;
            argv = &amp;amp;exception;
        }
    }

    rb_raise_jump(rb_make_exception(argc, argv), cause);

    UNREACHABLE_RETURN(Qnil);
}
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;(It looks like it then uses &lt;code&gt;rb_raise_jump&lt;/code&gt; to do the call stack mechanics of exception handling, since all we really see here is some argument parsing, but this isn&amp;rsquo;t the place to dig deeper into how it works.)&lt;/p&gt;

&lt;p&gt;I do wonder why &lt;code&gt;raise&lt;/code&gt; isn&amp;rsquo;t just a reserved word in the first place. I guess there must be a scenario where you would want to redefine it for some metaprogramming project, but I can&amp;rsquo;t think of a very sane use case.&lt;/p&gt;
</content>
  </entry>
  <entry>
    <title>The risks of looking too closely at things</title>
    <link rel="alternate" href="/tech/2025/11/looking-too-closely.html"/>
    <id>/tech/2025/11/looking-too-closely.html</id>
    <published>2025-11-06T23:26:00+00:00</published>
    <updated>2025-11-06T23:26:00+00:00</updated>
    <author>
      <name>Eli Thorkelson</name>
    </author>
    <summary type="html">&lt;p&gt;Sometimes it seems like, the closer you look at a running system, the more problems you find with it.&lt;/p&gt;

&lt;p&gt;It can feel like the very &lt;em&gt;act of looking&lt;/em&gt; creates weird anomalies that weren’t there if you didn’t look.&lt;/p&gt;

&lt;p&gt;Of course, they were there already. You...&lt;/p&gt;</summary>
    <content type="html">&lt;p&gt;Sometimes it seems like, the closer you look at a running system, the more problems you find with it.&lt;/p&gt;

&lt;p&gt;It can feel like the very &lt;em&gt;act of looking&lt;/em&gt; creates weird anomalies that weren&amp;rsquo;t there if you didn&amp;rsquo;t look.&lt;/p&gt;

&lt;p&gt;Of course, they were there already. You just weren&amp;rsquo;t noticing them.&lt;/p&gt;

&lt;p&gt;This seems to be true across many domains of life.&lt;/p&gt;

&lt;h2 id="anomalies-in-a-human-body"&gt;Anomalies in a human body&lt;/h2&gt;

&lt;p&gt;Indulge me in a personal story. I often have funny results when I get my bloodwork done at my primary care clinic. It&amp;rsquo;s common in my family to have a slightly low platelet count, and other randomness somewhat outside the norms.&lt;/p&gt;

&lt;p&gt;One year, my primary care provider got worried about my lab results and sent me to a hematologist.&lt;/p&gt;

&lt;p&gt;The hematologist was intrigued by my case. Specialists love mysteries. He ran a lot of blood tests. He started to speculate about numerous obscure syndromes that I could hypothetically have had. He found nothing with the easy tests, so he moved on to the complicated, obscure tests. Some of them involved PCR gene sequencing. Before I knew it, he had ordered more than $30,000 of medical tests.&lt;/p&gt;

&lt;p&gt;He never found any actual disease or any other problem that needed treatment. He sent me home with instructions to take more Vitamin B12 and come back again to be re-tested later.&lt;/p&gt;

&lt;p&gt;What I have taken from this is: If you look closely enough at something as complicated as a human body, you will usually find anomalies. And these anomalies don&amp;rsquo;t necessarily represent symptoms or illnesses. They might just&amp;hellip; exist.&lt;/p&gt;

&lt;p&gt;Eventually, doctors started to use the word &amp;ldquo;idiopathic&amp;rdquo; to describe my case. &lt;a href="https://en.wikipedia.org/wiki/Idiopathic_disease"&gt;Idiopathic&lt;/a&gt; just means &amp;ldquo;we don&amp;rsquo;t know why it happens, but no need to look into it further.&amp;rdquo;&lt;/p&gt;

&lt;p&gt;Medical analysis, which is all encoded in software these days, typically relies on population-level benchmarks for anomaly detection, like &amp;ldquo;platelet count below 120 is not normal.&amp;rdquo; But in fact, the same reporting thresholds don&amp;rsquo;t always produce useful data for everyone. Context-free rules frequently produce false positives.&lt;/p&gt;

&lt;p&gt;Beware of letting specialists go down rabbit holes that they can bill you for.&lt;/p&gt;

&lt;h2 id="anomalies-in-a-building"&gt;Anomalies in a building&lt;/h2&gt;

&lt;p&gt;The same effect applies in other areas too. Enough about bodies. Let&amp;rsquo;s talk about houses.&lt;/p&gt;

&lt;p&gt;We live in a relatively new house; it is about ten years old. If I look at it from a distance, it seems to be in pretty good shape. When people visit, they often say it looks good.&lt;/p&gt;

&lt;p&gt;But if I look at it more closely, I start to see lots of issues.&lt;/p&gt;

&lt;p&gt;The paint is peeling on a drawer. There is a smudge on the doorframe where our kids frequently touch. There is a small crack between a concrete surface and a brick surface. Little vines are climbing up the foundation. The drainage pipe in the yard looks clogged by leaves. Dust accumulates on top of the light fixtures where you can&amp;rsquo;t see it. The caulk in the bathtub needs fixing. A post is starting to get rotted out where it gets wet. The windows need washing, even where it&amp;rsquo;s very hard to reach them.&lt;/p&gt;

&lt;p&gt;Some of these are easy maintenance issues, once they are noticed. I go around the foundation a few times a year and remove the vines.&lt;/p&gt;

&lt;p&gt;Others, I have no idea how to fix, since I am not a carpenter. The drainage pipe is a conundrum.&lt;/p&gt;

&lt;p&gt;Other issues I don&amp;rsquo;t even know how to &lt;em&gt;detect&lt;/em&gt; by myself, even if I try. That rotted post was detected by a professional; we missed it. Even though it was potentially a structural problem.&lt;/p&gt;

&lt;p&gt;When I was younger, I naively imagined that houses mostly just &lt;em&gt;keep existing&lt;/em&gt; once they were built, as long as the roof was intact to keep the rain out. Now I realize that all structures are constantly changing in their environment, and they always require attention. Permanently.&lt;/p&gt;

&lt;p&gt;Nevertheless, some of the issues - like the smudged doorframe and the dusty light fixtures — are very, very minor. You can drive yourself crazy if you try to fix them all.&lt;/p&gt;

&lt;p&gt;As with the human body, the issues show up as soon as you look for them. But when you aren&amp;rsquo;t looking, they seem to vanish again. Some anomalies are safe to ignore; others not so much.&lt;/p&gt;

&lt;h2 id="anomalies-in-software-systems"&gt;Anomalies in software systems&lt;/h2&gt;

&lt;p&gt;Now back to software.&lt;/p&gt;

&lt;p&gt;There&amp;rsquo;s a kind of iterative cycle in observing our systems at work. It goes like this:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Due to a problem or crisis, we start looking closely at our logs and stats.&lt;/li&gt;
&lt;li&gt;While looking at the logs, numerous other issues jump out at us. These issues were usually there for a while, but no one noticed.&lt;/li&gt;
&lt;li&gt;Some of the new issues turn out to be minor or just noise. Others require fixing.&lt;/li&gt;
&lt;li&gt;We create some new alerts or metrics to filter for whatever signal we want to keep track of next time.&lt;/li&gt;
&lt;li&gt;We stop looking at the raw logs.&lt;/li&gt;
&lt;li&gt;New issues gradually appear in the logs, but no one notices them if they don&amp;rsquo;t hit the alert thresholds.&lt;/li&gt;
&lt;li&gt;The whole cycle repeats.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;One time during an incident call, we noticed an absolute flood of errors in our web request logs that had been happening for a long while. It turns out that some load balancer healthchecks were failing in a loop. It never caused any real problem, but it spammed us with errors.&lt;/p&gt;

&lt;p&gt;Alert rules are &lt;em&gt;not&lt;/em&gt; a solution to this general problem. In audio processing terms, one can see an alert rule as a &lt;a href="https://en.wikipedia.org/wiki/Band-pass_filter"&gt;bandpass filter against a signal&lt;/a&gt;. Unfortunately, all filters, and thus all alert rules, have signal/noise ratios, and thus they will pass noise. Especially for simple threshold-based alerts, there is a tradeoff between not enough filtering (then you are spamming yourself with too much noise) or too much filtering (then you are going to overlook real issues until they become incidents). Setting alert thresholds is informed guesswork (the polite word is &amp;ldquo;an empirical problem&amp;rdquo;).&lt;/p&gt;

&lt;p&gt;In fact, it requires expertise to even figure out if something is signal or noise. It&amp;rsquo;s not an easy problem. It requires context and often demands a judgment call about each case: &lt;em&gt;Does this matter?&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;My current bandaid for this problem is to randomly look at the raw data sometimes, just to see what I notice. Take random walks through your data. Randomly inspect at your house to see what you see.&lt;/p&gt;

&lt;p&gt;There&amp;rsquo;s no real substitute for looking at a log stream and trying to make sense of it. I almost always find something interesting. I also look every so often at dashboards that &lt;em&gt;don&amp;rsquo;t&lt;/em&gt; alert and check for any interesting trends in the graphs (see also &lt;a href="/tech/2025/10/reading-diagrams.html"&gt;reading diagrams&lt;/a&gt;).&lt;/p&gt;

&lt;p&gt;It&amp;rsquo;s probably getting more viable to ask a LLM to watch a log stream and check for anomalies. That would be a huge help if it can be done at scale. But we do have large log streams and there can be policy concerns, so I&amp;rsquo;m not sure if we&amp;rsquo;re ready for this yet. Also, we don&amp;rsquo;t know how reliable LLM analysis is, at this point.&lt;/p&gt;

&lt;p&gt;Anyway, what I take from this is: The act of looking closely at something can be dangerous. It can lead you to a lot of new places. Some of them are useless. But you have to figure out how to do it, because otherwise, you only find out about the issues when they do become incidents. It&amp;rsquo;s better to stay ahead of the risks, if you can.&lt;/p&gt;
</content>
  </entry>
  <entry>
    <title>Reading graphs and diagrams</title>
    <link rel="alternate" href="/tech/2025/10/reading-diagrams.html"/>
    <id>/tech/2025/10/reading-diagrams.html</id>
    <published>2025-10-11T11:06:00+00:00</published>
    <updated>2025-10-11T11:06:00+00:00</updated>
    <author>
      <name>Eli Thorkelson</name>
    </author>
    <summary type="html">&lt;p&gt;I’ve been looking at a lot of operational dashboards lately.&lt;/p&gt;

&lt;p&gt;The question presents itself: How do you read anything meaningful from a graph?&lt;/p&gt;

&lt;h2 id="training-the-eye"&gt;Training the eye&lt;/h2&gt;

&lt;p&gt;The human eye has exquisite pattern recognition capabilities.&lt;/p&gt;

&lt;p&gt;And yet by default they...&lt;/p&gt;</summary>
    <content type="html">&lt;p&gt;I&amp;rsquo;ve been looking at a lot of operational dashboards lately.&lt;/p&gt;

&lt;p&gt;The question presents itself: How do you read anything meaningful from a graph?&lt;/p&gt;

&lt;h2 id="training-the-eye"&gt;Training the eye&lt;/h2&gt;

&lt;p&gt;The human eye has exquisite pattern recognition capabilities.&lt;/p&gt;

&lt;p&gt;And yet by default they are not always enough. They might have to be trained.&lt;/p&gt;

&lt;p&gt;When I was 17 years old, still in high school, I got interested in astronomy. I lived in a college town where you could take some classes at the local university, so first I took the undergraduate astronomy survey course, and then enrolled in an independent study with a physics professor. She was probably 65 years old and she was happy, I think, that some kid was so interested in her field.&lt;/p&gt;

&lt;p&gt;So we worked through some chapters of an undergraduate textbook, R.C. Bless&amp;rsquo; &lt;em&gt;Discovering the Cosmos&lt;/em&gt; (1996), and covered various topics in stellar evolution. I remember looking at a photo of some stellar phenomenon in the textbook and being asked: &amp;ldquo;What do you see here?&amp;rdquo;&lt;/p&gt;

&lt;p&gt;&amp;ldquo;I don&amp;rsquo;t see anything,&amp;rdquo; I said.&lt;/p&gt;

&lt;p&gt;And my professor pointed out some little dots on the image and explained what they meant.&lt;/p&gt;

&lt;p&gt;She could see something I just&amp;hellip; couldn&amp;rsquo;t.&lt;/p&gt;

&lt;p&gt;Even though it was a simple diagram that we were both looking at.&lt;/p&gt;

&lt;p&gt;You have to train the eye to see diagrams in a given domain.&lt;/p&gt;

&lt;h2 id="diagrams-tell-stories"&gt;Diagrams tell stories&lt;/h2&gt;

&lt;p&gt;Enough about astronomy; let&amp;rsquo;s talk about web services.&lt;/p&gt;

&lt;p&gt;Suppose you have a graph over time of 5xx (internal error) responses from a set of web services in different environments.&lt;/p&gt;

&lt;p&gt;Someone is asking you, &amp;ldquo;Do we currently have an ongoing incidnt? Or is the incident over?&amp;rdquo;&lt;/p&gt;

&lt;p&gt;You find some data. Let&amp;rsquo;s say it looks like this:&lt;/p&gt;

&lt;p&gt;&lt;img src="/images/tech/sample-5xx.png" alt="pretend graph of 5xx error responses"&gt;&lt;/p&gt;

&lt;p&gt;You can tell a story about this image that&amp;rsquo;s something like:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&amp;ldquo;There was a period of heightened errors on our production system earlier, lasting for a few hours. It is now done. There is also a more recent error spike in the training environment, but it is very low priority because the training environment is not a critical system.&amp;rdquo;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;But if you had a different context, another story you could tell from this would be:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&amp;ldquo;There is always a non zero error rate for this system, which is normal. There are a few relative blips in the error rate, but they are all very tiny compared to the 10 million successful requests per hour that this service handles.&amp;rdquo;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;You have to know what counts as &amp;ldquo;baseline&amp;rdquo; to tell a story from a diagram.&lt;/p&gt;

&lt;p&gt;Yet another story you could tell about this diagram would be like:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&amp;ldquo;This is a graph based on application server data, with telemetry emitted after each request is complete. However, upstream of this system is a load balancer, which has recently started showing much bigger error spikes up to 15,000 errors per hour. These bursts don&amp;rsquo;t appear at all on the current graph, because these requests are overloading the application server and breaking the instrumentation system.&amp;rdquo;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This brings us to a second important point.&lt;/p&gt;

&lt;h2 id="all-diagrams-lie"&gt;All diagrams lie&lt;/h2&gt;

&lt;p&gt;You can&amp;rsquo;t trust diagrams. You can learn from them, but you have to understand they are, by design, partial and limited.&lt;/p&gt;

&lt;p&gt;Here are a few ways that diagrams can lie. (There are many others.)&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Diagrams are probably missing data. For instance, maybe there are cases that are invisible to the telemetry system you use. As we just suggested, if your telemetry comes from back-end application servers, it might miss errors that are only apparent upstream at the load balancer. Or if your application process crashes or gets halted by Kubernetes, it probably can&amp;rsquo;t report its last state before halting.&lt;/li&gt;
&lt;li&gt;Visualization requires aggregation functions. Aggregation functions are inevitably going to mislead you. Do you want to see the median, the mean, the 95th percentile, the 99th percentile? There is a particular limit case in my Prometheus dashboards where the aggregation query fails to count the first value in a new series. I think it&amp;rsquo;s because they are based on &lt;code&gt;increase&lt;/code&gt; and &lt;code&gt;increase&lt;/code&gt; doesn&amp;rsquo;t handle this nicely - my guess is that it tries to compute a difference between time A and time B, but the difference between &amp;ldquo;undefined&amp;rdquo; and &amp;ldquo;1&amp;rdquo; is represented as 0, not 1.&lt;/li&gt;
&lt;li&gt;Timeslices matter hugely. For instance, if you are counting events over time, you might have to count them over a given timeslice. And the graph is almost always going to look bumpier, noisier, when the timeslice is smaller. However, for performance reasons, you can&amp;rsquo;t use small timeslices over large intervals.&lt;/li&gt;
&lt;li&gt;Metrics may have a collection interval below which you can&amp;rsquo;t disambiguate. If you scrape Prometheus metrics data every 60 seconds, for instance, then you simply cannot see sub-minute spikes in the data series, no matter what you try. As with music, the sampling frequency matters.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2 id="nonlinear-behavior-is-common"&gt;Nonlinear behavior is common&lt;/h2&gt;

&lt;p&gt;Diagrams can lend themselves to the illusion of linearity. You can often identify linear-looking traffic growth over time, or even determine visually that traffic appears steady state.&lt;/p&gt;

&lt;p&gt;At some point, the illusion (or assumption) of mostly linear behavior is going to break down. Databases will run out of resource limits and performance will plummet. Small fluctuations in a noisy baseline graph will suddenly resolve into huge spikes in an outage.&lt;/p&gt;

&lt;p&gt;You have to train yourself &lt;em&gt;not to expect linear behavior&lt;/em&gt; in complex systems, because they are fundamentally nonlinear, and will hit limits suddenly.&lt;/p&gt;

&lt;p&gt;You have to learn, gradually, how to include adequate &amp;ldquo;safety factors&amp;rdquo; in your assessment. &amp;ldquo;My system has used 80% of its available resources, does this mean we can keep waiting until we use 95% of resources? Or should we increase resources now?&amp;rdquo; It depends on your forecast of possible linear &lt;em&gt;or nonlinear&lt;/em&gt; growth in the future.&lt;/p&gt;

&lt;p&gt;This implies one last point.&lt;/p&gt;

&lt;h2 id="your-understanding-of-diagrams-reflects-your-understanding-of-the-system"&gt;Your understanding of diagrams reflects your understanding of the system&lt;/h2&gt;

&lt;p&gt;If you are a visually inclined person, then graphs and diagrams can be powerful tools for building understanding of a system.&lt;/p&gt;

&lt;p&gt;However, &lt;em&gt;at the same time&lt;/em&gt;, your baseline understanding of a system &lt;em&gt;feeds into&lt;/em&gt; what you can read out of a diagram based on that system.&lt;/p&gt;

&lt;p&gt;Ideally, these two constraints are mutually reinforcing instead of being mutually disabling.&lt;/p&gt;

&lt;p&gt;For example, if you always deploy new code every Tuesday morning, then you might know intuitively that a latency spike on Tuesday mornings is related to the deployment event, and not otherwise concerning.&lt;/p&gt;

&lt;p&gt;However, if you just see a latency spike on the most recent Tuesday, and you don&amp;rsquo;t have any other context, then you might spend some time chasing down various nonsense theories and speculations before you eventually figure out that the same spike happens every Tuesday.&lt;/p&gt;

&lt;p&gt;(Here is a version of this that I&amp;rsquo;ve lived through: Q: &amp;ldquo;Why is there a huge latency spike towards the end of the month? What&amp;rsquo;s broken?&amp;rdquo; A: &amp;ldquo;Oh wait&amp;hellip; it&amp;rsquo;s the end of the fiscal quarter and usage is always heightened at this time.&amp;rdquo;)&lt;/p&gt;

&lt;p&gt;You have to let yourself &lt;em&gt;learn&lt;/em&gt; from graphs and diagrams (we should never throw away any source of useful knowledge), while also recalling that graphs and diagrams often lie, and almost always omit important context.&lt;/p&gt;

&lt;p&gt;As with so many other things, experience isn&amp;rsquo;t everything, but it counts for a lot.&lt;/p&gt;
</content>
  </entry>
  <entry>
    <title>On incident fatigue</title>
    <link rel="alternate" href="/tech/2025/08/incident-fatigue.html"/>
    <id>/tech/2025/08/incident-fatigue.html</id>
    <published>2025-08-02T17:22:00+00:00</published>
    <updated>2025-08-02T17:22:00+00:00</updated>
    <author>
      <name>Eli Thorkelson</name>
    </author>
    <summary type="html">&lt;p&gt;Once upon a time, in a galaxy far away, we had a week where things just kept breaking.&lt;/p&gt;

&lt;p&gt;Some of the broken things were in production environments. Some were in engineering environments.&lt;/p&gt;

&lt;p&gt;Some of them were incidents. Some of them were merely… issues...&lt;/p&gt;</summary>
    <content type="html">&lt;p&gt;Once upon a time, in a galaxy far away, we had a week where things just kept breaking.&lt;/p&gt;

&lt;p&gt;Some of the broken things were in production environments. Some were in engineering environments.&lt;/p&gt;

&lt;p&gt;Some of them were incidents. Some of them were merely&amp;hellip; issues.&lt;/p&gt;

&lt;aside class="blog-aside"&gt;

&lt;p class="aside-title"&gt;The difference between an incident and an issue&lt;/p&gt;

&lt;p&gt;The difference between an incident and an issue is this:&lt;/p&gt;

&lt;p&gt;If a tree falls in a forest and no one hears it, it&amp;rsquo;s an issue.&lt;/p&gt;

&lt;p&gt;If someone does hear it, then it&amp;rsquo;s an incident.&lt;/p&gt;

&lt;p&gt;&lt;i&gt;Meta: This simple distinction is not always clear in practice.&lt;/i&gt;&lt;/p&gt;

&lt;/aside&gt;

&lt;p&gt;The broken things were roughly like the following:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Request timeouts being reached in critical flows, where we didn&amp;rsquo;t previously expect to see them being reached.&lt;/li&gt;
&lt;li&gt;Health checks failing because of seemingly unrelated configuration changes.&lt;/li&gt;
&lt;li&gt;Mysterious CPU spikes in certain database instances. (Well, they were mysterious before we knew the explanation, anyway.)&lt;/li&gt;
&lt;li&gt;Background jobs not running as scheduled.&lt;/li&gt;
&lt;li&gt;Networking problems caused by configuration problems caused ultimately by cross-team misalignments.&lt;/li&gt;
&lt;li&gt;A series of other awkward things that we can&amp;rsquo;t even get into here.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The details aren&amp;rsquo;t what matters. The point is that there were a &lt;em&gt;lot&lt;/em&gt; of technical issues in a short span of time.&lt;/p&gt;

&lt;p&gt;This was exacerbated by a series of other factors:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Nontechnical stakeholders were clamoring for answers.&lt;/li&gt;
&lt;li&gt;Key people were out of the office.&lt;/li&gt;
&lt;li&gt;We were dealing with really complicated causal chains. The symptoms of things were very far removed from the causes. Every issue required analyzing a long, complex set of effects.&lt;/li&gt;
&lt;li&gt;There are several different logging systems, and not all requests are tracked in all of them.&lt;/li&gt;
&lt;li&gt;We faced jurisdictional ambiguities about which team owns what.&lt;/li&gt;
&lt;li&gt;We were spread out across so many different time zones.&lt;/li&gt;
&lt;li&gt;We have so many different Slack channels.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I started to think that our incident response strategies were fundamentally built to handle &lt;em&gt;one issue at a time.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;So: &lt;strong&gt;What do you do when there are more incidents than there are people on call?&lt;/strong&gt;&lt;/p&gt;

&lt;h2 id="what-we-did-improvisation"&gt;What we did: Improvisation&lt;/h2&gt;

&lt;p&gt;We did not explicitly have a plan for this situation. We expect that there will basically be one incident at a time. If only incidents were so well behaved.&lt;/p&gt;

&lt;p&gt;So we improvised. We entered a zone of undefined organizational behavior.&lt;/p&gt;

&lt;p&gt;It was a group of experienced software engineers, relatively committed to their work. We leaned on each other for support. People not on call got pulled in.&lt;/p&gt;

&lt;p&gt;Soon there were people losing sleep, staying up late out of concern, and just getting worn out.&lt;/p&gt;

&lt;p&gt;Later, things did start to calm down.&lt;/p&gt;

&lt;h2 id="what-would-i-do-next-time"&gt;What would I do, next time?&lt;/h2&gt;

&lt;p&gt;What can one do to make it better?&lt;/p&gt;

&lt;p&gt;I think it starts with recognizing that &lt;strong&gt;incident fatigue&lt;/strong&gt; is just its own thing, categorically distinct from other kinds of problems. It is a particular state one can, unfortunately, get into.&lt;/p&gt;

&lt;p&gt;Of course, tech business culture gives us some general strategies about what to do.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;em&gt;Strategy 1: Prioritization&lt;/em&gt;. When there are a lot of issues, rank them by urgency and only handle the most critical ones first. (Problem: Too many issues are critical, and/or their criticality is hard to assess. Eventually, major issues are left unfixed.)&lt;/li&gt;
&lt;li&gt;&lt;em&gt;Strategy 2: Deflection&lt;/em&gt;. When there are a lot of issues, try to send them elsewhere, see who else can handle them. (Problem: Dealing with organizational jurisdiction issues can take as much time and effort as just fixing the issue yourself. And the latter actually delivers some practical value. Problem 2: Over time, too much deflection can erode that sense of commitment that we can call &amp;ldquo;ownership&amp;rdquo; of technical systems.)&lt;/li&gt;
&lt;li&gt;&lt;em&gt;Strategy 3: Self-care.&lt;/em&gt; A very American cultural theory, it seems to me - the idea that probably you should try to do something extra nice for yourself to make up for the extra stress. (Problem: It doesn&amp;rsquo;t really solve anything, it just distracts.)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I&amp;rsquo;m not against any of these strategies, in general terms, but I&amp;rsquo;m not sure they help enough. I keep thinking that there&amp;rsquo;s more we could do.&lt;/p&gt;

&lt;p&gt;Maybe there could be some other division of labor, where we move beyond the &amp;ldquo;first responder&amp;rdquo; model, and try to solve issues more directly as a team.&lt;/p&gt;

&lt;p&gt;Maybe there could be some way to acknowledge the fatigue and get some extra rest.&lt;/p&gt;

&lt;p&gt;Maybe it would help to organize a retrospective, not for specific issues, but for the whole set of issues. A space to think about how we handled things overall.&lt;/p&gt;

&lt;p&gt;A little bit of collective recognition seems like it would help.&lt;/p&gt;

&lt;h2 id="coda-what-we-did-afterwards"&gt;Coda: What we did afterwards&lt;/h2&gt;

&lt;p&gt;We did end up having a retrospective session to discuss these issues, which was helpful. We were able to identify certain common root causes among our issues.&lt;/p&gt;

&lt;p&gt;I came away thinking: I would do more to call attention to incident fatigue sooner, if we ever get there again.&lt;/p&gt;

&lt;p&gt;I would do more to ensure people get enough rest and don&amp;rsquo;t overwork.&lt;/p&gt;
</content>
  </entry>
  <entry>
    <title>Naive optimizations of CI job allocations</title>
    <link rel="alternate" href="/tech/2025/07/job-allocation.html"/>
    <id>/tech/2025/07/job-allocation.html</id>
    <published>2025-07-20T11:42:00+00:00</published>
    <updated>2025-07-20T11:42:00+00:00</updated>
    <author>
      <name>Eli Thorkelson</name>
    </author>
    <summary type="html">&lt;p&gt;One time I had a job interview with a live-coding challenge. It went badly, because I was too anxious to do well. It asked a question about optimally allocating test jobs across CI workers. In essence, a bin packing problem.&lt;/p&gt;

&lt;aside class="blog-aside"&gt;

&lt;p class="aside-title"&gt;A note on live coding...&lt;/p&gt;&lt;/aside&gt;</summary>
    <content type="html">&lt;p&gt;One time I had a job interview with a live-coding challenge. It went badly, because I was too anxious to do well. It asked a question about optimally allocating test jobs across CI workers. In essence, a bin packing problem.&lt;/p&gt;

&lt;aside class="blog-aside"&gt;

&lt;p class="aside-title"&gt;A note on live coding in interviews&lt;/p&gt;

&lt;p&gt;It&amp;rsquo;s hard to get any valid data out of live coding challenges, because some people (like me that day) get too stressed out to do well in them. I would recommend to interviewers to not do them (especially the &amp;ldquo;write me a new program from scratch in real time&amp;rdquo; variant). It&amp;rsquo;s totally unrepresentative of ordinary work contexts. The &amp;ldquo;homework&amp;rdquo; approach is better, or asking someone to do code review of a code sample - there are lots of less-anxiety-making options.&lt;/p&gt;

&lt;p&gt;For candidates, conversely, you kind of have to expect that there &lt;i&gt;could&lt;/i&gt; be live coding exercises. So it&amp;rsquo;s probably better to practice doing live coding &lt;i&gt;before&lt;/i&gt; the interview so you&amp;rsquo;re a little more ready. I don&amp;rsquo;t think I did that in advance&amp;hellip;&lt;/p&gt;

&lt;/aside&gt;

&lt;p&gt;OK, so, test job allocation.&lt;/p&gt;

&lt;p&gt;Suppose you have a set of test jobs and you want to allocate them to a set of workers for execution. What is the optimal way to do this?&lt;/p&gt;

&lt;p&gt;At moments like this I wish I had taken the CS class that covers queueing theory and bin packing problems. If you never took that class, there&amp;rsquo;s no good way to deduce it all from scratch in a random job interview.&lt;/p&gt;

&lt;p&gt;Well anyway, that interview went badly. It did not produce any working implementation, and I was obviously not hired. But I went home and wrote the code I wished I had been relaxed enough to write. It compares naive allocation algorithms with the following preconditions:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;We know the runtime duration of each test file at the start. (Obviously not true in practice in any significant CI system, but we can probably use &lt;em&gt;expected&lt;/em&gt; duration or &lt;em&gt;historical&lt;/em&gt; average duration as a proxy here.)&lt;/li&gt;
&lt;li&gt;The durations range from 1-251 seconds.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id="what-counts-as-optimal"&gt;What counts as optimal?&lt;/h2&gt;

&lt;p&gt;We want to allocate job load as evenly as possible among workers, such that each worker finishes at the same time. We don&amp;rsquo;t want to have worker 1 finish first and sit idle, while worker 2 keeps working twice as long.&lt;/p&gt;

&lt;p&gt;As the jobs have randomly large durations, there is no absolute guarantee that all workers can finish at the same moment. If you have two workers and two jobs, and job &lt;em&gt;J1&lt;/em&gt; takes 1 minute while &lt;em&gt;J2&lt;/em&gt; takes 3 minutes, then one worker is going to finish 2 minutes sooner than the other, no matter how you allocate them.&lt;/p&gt;

&lt;p&gt;But in principle, especially as the individual job durations approach zero relative to the total length of the queue, you can evenly divide the jobs among workers and they will all finish simultaneously. I like to think here of the analogy with dividing a large volume of water into multiple buckets: if we were dividing a continuous volume &lt;em&gt;V&lt;/em&gt; liters into &lt;em&gt;N&lt;/em&gt; buckets, then each bucket should ideally contain the exact same quantity of water, namely &lt;em&gt;V/N&lt;/em&gt; liters.&lt;/p&gt;

&lt;p&gt;Of course, jobs do have discrete individual durations, so the analogy with dividing a fluid is inexact. But it still gives us an excellent target to measure against. For any allocation of a finite set of jobs across a given set of workers, we can measure how closely our allocation approaches the ideal case of perfect division across a worker pool.&lt;/p&gt;

&lt;p&gt;Say that the total duration of a set of jobs is &lt;em&gt;D&lt;/em&gt; seconds and the number of workers is &lt;em&gt;N&lt;/em&gt;. Ideally, each worker should finish processing after &lt;em&gt;D/N&lt;/em&gt; seconds. Let&amp;rsquo;s call this the &lt;em&gt;optimal target duration&lt;/em&gt; for each worker, or &lt;em&gt;t(optimal)&lt;/em&gt; for short.&lt;/p&gt;

&lt;p&gt;You can then trivially measure the variance from the optimal outcome for each worker with:&lt;/p&gt;

&lt;p&gt;100 * (&lt;em&gt;t(actual)&lt;/em&gt; - &lt;em&gt;t(optimal)&lt;/em&gt;) / &lt;em&gt;t(optimal)&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;(Multiply by 100 to get an output in percent, so you can say that a particular worker finished 1% slower than optimal, or 30% faster, etc.)&lt;/p&gt;

&lt;h2 id="some-naive-allocation-algorithms"&gt;Some naive allocation algorithms&lt;/h2&gt;

&lt;p&gt;Let&amp;rsquo;s suppose that before allocating jobs, we sort the input job set into an list that&amp;rsquo;s sorted by historical job duration, so we can allocate larger jobs from the tip and small jobs from the tail of the list.&lt;/p&gt;

&lt;p&gt;We want to explore algorithms for sorting jobs into &lt;em&gt;N&lt;/em&gt; worker queues. For simplicity, let&amp;rsquo;s stipulate that &lt;em&gt;N&lt;/em&gt; must be a power of 2.&lt;/p&gt;

&lt;p&gt;I tested some simple allocation algorithms:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Allocate from ends&lt;/strong&gt;: Go through each worker in round robin form. Push the longest available test job onto each queue. If there are any more test jobs, also push the shortest available test job onto each queue. My inspiration here was about trying to balance out the slowest and fastest jobs across our available workers (so each queue gets part of the short and long tails).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Allocate to one worker until full&lt;/strong&gt;: Compute &lt;em&gt;t(optimal)&lt;/em&gt; for a given set of workers. Allocate jobs to the first worker until it contains &lt;em&gt;&amp;gt;= t(optimal)&lt;/em&gt; jobs. Then allocate jobs to the second worker until it reaches or exceeds &lt;em&gt;t(optimal)&lt;/em&gt;, and so on for &lt;em&gt;N&lt;/em&gt; workers. (I believe this is fairly similar to the &lt;a href="https://en.wikipedia.org/wiki/Next-fit_bin_packing"&gt;next fit bin packing algorithm&lt;/a&gt;.)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Recursive partitioning&lt;/strong&gt;: Create two buckets of jobs, &lt;em&gt;B1&lt;/em&gt; and &lt;em&gt;B2&lt;/em&gt;. Iterate through all the jobs, assigning the next job to whichever bucket is currently emptier. (This handles the case where you put a large job into &lt;em&gt;B1&lt;/em&gt; and then you put a number of smaller jobs into &lt;em&gt;B2&lt;/em&gt; until they converge.) Then recursively repeat the process, dividing each bucket into 2 more buckets via the same process. Stop when you have allocated jobs into &lt;em&gt;N&lt;/em&gt; buckets, equivalent to the available number of workers.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Rebalanced recursive allocation&lt;/strong&gt;. Follow the process for recursive allocation, but before recursing, also try to &lt;em&gt;rebalance&lt;/em&gt; the buckets by swapping an item between queues in such a way as to make the total duration of each buckets closer to each other. Suppose that bucket &lt;em&gt;B1&lt;/em&gt; has duration 150 and bucket &lt;em&gt;B2&lt;/em&gt; has duration 130. We can rebalance them by looking for a job in &lt;em&gt;B1&lt;/em&gt; that has duration (150-130)/2 = 15 and moving it to &lt;em&gt;B2&lt;/em&gt;. If we can&amp;rsquo;t find a job whose duration is exactly 15, we can just use the closest approximation we can find. We would not swap anything if there are no items we can move that improve the balance between buckets.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Intuitively, I expect the recursive algorithms to work better than the round-robin or next-fit approaches.&lt;/p&gt;

&lt;h2 id="measuring-the-results"&gt;Measuring the results&lt;/h2&gt;

&lt;p&gt;As the total duration of all jobs &lt;em&gt;t(total)&lt;/em&gt; grows very large compared to the maximum duration of any individual job &lt;em&gt;t(individual_max)&lt;/em&gt;, we should be able to converge on solutions that approach the optimal (continuous) target &lt;em&gt;t(optimal)&lt;/em&gt; mentioned above.&lt;/p&gt;

&lt;p&gt;I wrote a ruby script to generate random input sets and test each allocation algorithm against them at various sizes. The results were as follows:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;Small scale: 4 workers, 40 jobs
Ran 150 rounds of allocate_from_ends with 40 jobs in 4 workers
      =&amp;gt; Duration 3.1ms
      =&amp;gt; Mean variance from optimal allocation: 0.912%
      =&amp;gt; Max variance from optimal allocation: 4.831%
Ran 150 rounds of allocate_until_full with 40 jobs in 4 workers
      =&amp;gt; Duration 3.5ms
      =&amp;gt; Mean variance from optimal allocation: 2.237%
      =&amp;gt; Max variance from optimal allocation: 7.23%
Ran 150 rounds of recursive_allocation with 40 jobs in 4 workers
      =&amp;gt; Duration 4.1ms
      =&amp;gt; Mean variance from optimal allocation: 4.999%
      =&amp;gt; Max variance from optimal allocation: 9.178%
Ran 150 rounds of optimized_recursive_allocation with 40 jobs in 4 workers
      =&amp;gt; Duration 7.8ms
      =&amp;gt; Mean variance from optimal allocation: 0.568%
      =&amp;gt; Max variance from optimal allocation: 2.807%
===================
Medium scale: 16 workers, 250 jobs
Ran 150 rounds of allocate_from_ends with 250 jobs in 16 workers
      =&amp;gt; Duration 17.2ms
      =&amp;gt; Mean variance from optimal allocation: 3.921%
      =&amp;gt; Max variance from optimal allocation: 11.217%
Ran 150 rounds of allocate_until_full with 250 jobs in 16 workers
      =&amp;gt; Duration 17.8ms
      =&amp;gt; Mean variance from optimal allocation: 1.274%
      =&amp;gt; Max variance from optimal allocation: 4.754%
Ran 150 rounds of recursive_allocation with 250 jobs in 16 workers
      =&amp;gt; Duration 23.4ms
      =&amp;gt; Mean variance from optimal allocation: 3.223%
      =&amp;gt; Max variance from optimal allocation: 5.441%
Ran 150 rounds of optimized_recursive_allocation with 250 jobs in 16 workers
      =&amp;gt; Duration 39.3ms
      =&amp;gt; Mean variance from optimal allocation: 0.194%
      =&amp;gt; Max variance from optimal allocation: 1.078%
===================
Large scale: 32 workers, 20000 jobs
Ran 30 rounds of allocate_from_ends with 20000 jobs in 32 workers
      =&amp;gt; Duration 229.9ms
      =&amp;gt; Mean variance from optimal allocation: 0.16%
      =&amp;gt; Max variance from optimal allocation: 0.175%
Ran 30 rounds of allocate_until_full with 20000 jobs in 32 workers
      =&amp;gt; Duration 263.5ms
      =&amp;gt; Mean variance from optimal allocation: 0.006%
      =&amp;gt; Max variance from optimal allocation: 0.048%
Ran 30 rounds of recursive_allocation with 20000 jobs in 32 workers
      =&amp;gt; Duration 342.9ms
      =&amp;gt; Mean variance from optimal allocation: 0.039%
      =&amp;gt; Max variance from optimal allocation: 0.056%
Ran 30 rounds of optimized_recursive_allocation with 20000 jobs in 32 workers
      =&amp;gt; Duration 350.6ms
      =&amp;gt; Mean variance from optimal allocation: 0.0%
      =&amp;gt; Max variance from optimal allocation: 0.001%
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;TLDR:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;As expected, smaller input sizes correspond overall with higher variance from the optimal outcome.&lt;/li&gt;
&lt;li&gt;The worst observed performance in any single experiment was 11.2% deviance from optimal, in the medium scale test of the &amp;ldquo;allocate from ends&amp;rdquo; strategy. Mean deviance for that strategy at that scale was lower, at 3.9%.&lt;/li&gt;
&lt;li&gt;At large volumes, all these algorithms have well under 1% error. So our expectation that at large scale, we would converge on the optimal (continuous) division case, seemed to be proved true.&lt;/li&gt;
&lt;li&gt;The optimized recursive strategy was the most successful across the board. Given a large input size of 20,000 jobs, it had &lt;em&gt;at worst&lt;/em&gt; 0.001% deviance from the optimal solution over 30 experimental trials.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id="takeaways"&gt;Takeaways&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;I would never try to implement an optimized allocation algorithm at work without having some quiet time to do research first. Kind of disliking that someone made me try to do it as a live-coding exercise.&lt;/li&gt;
&lt;li&gt;Bin packing theory seems like an interesting thing to look into further. Probably a good thing to catch up on, given that I never got a CS degree.&lt;/li&gt;
&lt;li&gt;In practice, at work, we just delegate this problem (CI job allocation) to a SaaS solution (Knapsack Pro). We don&amp;rsquo;t try to roll our own solutions here.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id="coda"&gt;Coda&lt;/h2&gt;

&lt;p&gt;The experimental code was as follows.&lt;/p&gt;

&lt;pre&gt;&lt;code class="ruby"&gt;
############################
# Experiment - a class that tests a given allocation strategy for a given number of rounds,
# using a specified number of buckets and a specified volume of randomly generated tests
# After the experiment is done, it outputs its results to stdout.
############################
class Experiment
  def initialize(strategy, rounds:, buckets:, tests:)
    total_variance = 0
    max_variance = 0
    start = Time.now

    rounds.times do
      test_mean, test_max = run_test(strategy, buckets, sample(tests))
      total_variance += test_mean
      max_variance = test_max if test_max &amp;gt; max_variance
    end
    duration = (Time.now - start)*1000 # ms
    puts &amp;quot;Ran #{rounds} rounds of #{strategy} with #{tests} tests in #{buckets} buckets&amp;quot;
    puts &amp;quot;      =&amp;gt; Duration #{duration.round(1)}ms&amp;quot;
    puts &amp;quot;      =&amp;gt; Mean variance from optimal allocation: #{(total_variance/rounds).round(3)}%&amp;quot;
    puts &amp;quot;      =&amp;gt; Max variance from optimal allocation: #{max_variance.round(3)}%&amp;quot;
  end

  def run_test(strategy, n, test_files)
    sorted_tests = test_files.to_a.sort_by(&amp;amp;:last)

    # optimally, we should have (sum(total_test_time)/n) in each bucket
    total_time = sorted_tests.map(&amp;amp;:last).inject(&amp;amp;:+)
    target = total_time.to_f / n

    buckets = (1..n).map { Bucket.new(target) }
    full_buckets = 0

    puts &amp;quot;Allocating #{test_files.count} tests into #{n} buckets, optimal bucket size: #{target}&amp;quot; if ENV[&amp;#39;VERBOSE&amp;#39;]

    buckets = Allocator.new.send strategy, sorted_tests, buckets, target

    # Report results
    puts buckets.map(&amp;amp;:status) if ENV[&amp;#39;VERBOSE&amp;#39;]

    # Return average variance
    mean_variance = buckets.map {|b| b.variance.abs}.sum / buckets.size
    max_variance = buckets.map {|b| b.variance.abs}.max
    puts &amp;quot;  ... mean variance #{mean_variance}, max variance #{max_variance}&amp;quot; if ENV[&amp;#39;VERBOSE&amp;#39;]
    # Return mean variance
    [mean_variance, max_variance]
  end

  # Generates a random test data set
  def sample(size)
    (1..size).map { |i| [&amp;quot;test-#{i}.rb&amp;quot;.to_sym, Integer(rand * 250) + 1]}.to_h
  end
end

############################
# Allocator - a class implementing several possible algorithms for allocating input
#   items to buckets.
# The inputs to these methods are called `sorted_tests` because the original spec was for
#   each item to represent a test file with an integer duration in seconds, and we expect
#   the Experiment class to start out by sorting the inputs before invoking these methods.
############################
class Allocator
  # Allocate sorted tests into buckets, pulling from the ends of the array at each iteration
  def allocate_from_ends(sorted_tests, buckets, target)
    current_bucket = 0
    while sorted_tests.count &amp;gt; 0
      current = buckets[current_bucket % buckets.size]
      current.add sorted_tests.pop
      current.add(sorted_tests.shift) if sorted_tests.size &amp;gt; 0 # we might run out of tests
      current_bucket += 1
    end
    buckets
  end

  # Allocate sorted tests into buckets, removing buckets from rotation once they reach capacity
  def allocate_until_full(sorted_tests, buckets, target)
    current_bucket = 0
    full_buckets = 0
    while sorted_tests.count &amp;gt; 0 &amp;amp;&amp;amp; full_buckets &amp;lt; buckets.count
      current = buckets[current_bucket % buckets.size]
      if current.total &amp;gt; target # don&amp;#39;t add to already full buckets
        if !current.full
          current.full = true
          full_buckets += 1

          puts &amp;quot; ... just filled up a bucket: #{current.status}&amp;quot; if ENV[&amp;#39;VERBOSE&amp;#39;]
        end
      else
        current.add sorted_tests.pop
      end
      current_bucket += 1
    end
    buckets
  end

  # Repeatedly divide the list into two as-close-to-equal-halves as we can
  def recursive_allocation(sorted_tests, buckets, target)
    recursion_levels = Math.log(buckets.count, 2)
    raise &amp;quot;Must pass number of buckets that are a power of 2&amp;quot; unless recursion_levels % 1 == 0

    # Don&amp;#39;t reuse the buckets we were given, we&amp;#39;ll regenerate them in the recursion process
    buckets.clear
    buckets.concat recursive_list_division(sorted_tests, recursion_levels.to_i, target, optimize: false).flatten
  end

  # Same as recursive_allocation, but also make an effort to balance out imbalances between left and right halves
  def optimized_recursive_allocation(sorted_tests, buckets, target)
    recursion_levels = Math.log(buckets.count, 2)
    raise &amp;quot;Must pass number of buckets that are a power of 2&amp;quot; unless recursion_levels % 1 == 0

    # Don&amp;#39;t reuse the buckets we were given, we&amp;#39;ll regenerate them in the recursion process
    buckets.clear
    buckets.concat recursive_list_division(sorted_tests, recursion_levels.to_i, target, optimize: true).flatten
  end

  # Actually does the work for the recursive strategy
  def recursive_list_division(items, levels_remaining, target, optimize: false)
    left = Bucket.new(target)
    right = Bucket.new(target)

    # Allocate the next item to whichever list is currently emptier
    while items.size &amp;gt; 0
      if left.total &amp;lt;= right.total
        left.add items.pop
      else
        right.add items.pop
      end
    end
    puts &amp;quot;Balanced two lists: #{left.total} vs #{right.total}&amp;quot; if ENV[&amp;#39;VERBOSE&amp;#39;]

    # Try to balance the two lists if optimization is enabled
    if optimize
      25.times do
        left, right = recursive_list_balance(left, right)
      end
    end

    if levels_remaining &amp;gt; 1
      [ recursive_list_division(left.items, levels_remaining - 1, target, optimize: optimize),
        recursive_list_division(right.items, levels_remaining - 1, target, optimize: optimize)]
    else
      [left, right]
    end
  end

  # Attempts to balance out gaps between two unequal lists
  def recursive_list_balance(left, right)
    # Compute the gap between left and right -- that&amp;#39;s what we hope to improve
    difference = (left.total - right.total).abs
    # Don&amp;#39;t bother doing anything if the two lists happen to be identical
    return [left, right] if difference == 0

    # Now search for an element in the larger list
    # that is about half the size of the difference between lists:
    larger = left.total &amp;gt; right.total ? left.items : right.items
    candidate = nil
    for i in (0..larger.size - 1) do
      if larger[i][1] &amp;gt; difference / 2
        candidate = i
        break
      end
    end
    return [left, right] if candidate.nil?

    # Either the candidate item or the item preceding it should be the best item to swap
    if candidate &amp;gt; 0 &amp;amp;&amp;amp; ((larger[candidate][1] - difference) &amp;gt; (larger[candidate - 1][1] - difference))
      candidate -= 1
    end

    # Let&amp;#39;s only do the swap if it actually improves the balance between buckets
    if larger[candidate][1] &amp;lt; difference
      puts &amp;quot;Balancing lists: moving item of size #{larger[candidate][1]} to balance a gap of size #{difference}&amp;quot; if ENV[&amp;#39;VERBOSE&amp;#39;]
      if left.total &amp;gt; right.total
        right.add left.remove(candidate)
      else
        left.add right.remove(candidate)
      end
    else
      puts &amp;quot;Best candidate was #{larger[candidate][1]} to balance a gap of size #{difference}, skipping&amp;quot; if ENV[&amp;#39;VERBOSE&amp;#39;]
    end
    [left, right]
  end
end

############################
# Bucket - a utility class that contains a bucket of items and calculates their total size.
#  Initialize it with a target size, and then it can calculate the
#    difference between its actual total and the optimal target result size.
#    Let&amp;#39;s call this difference the `variance`.
#  It implements two list operations, `add` and `remove`.
#  It has a flag called `full` that allocation algorithms might use
#    to keep track of buckets that have exceeded their optimal capacity.
############################
class Bucket
  attr_accessor :total
  attr_accessor :full
  attr_accessor :target
  attr_accessor :items

  def initialize(target=0)
    @total = 0
    @items = []
    @target = target
    @full = false
  end

  def add(item)
    @total += item[1]
    @items &amp;lt;&amp;lt; item
  end

  def remove(item_index)
    raise &amp;quot;can&amp;#39;t remove invalid item with index #{item_index}, total items #{@items.size}&amp;quot; if @items[item_index].nil?
    item = @items.delete_at(item_index)
    @total -= item[1]
    item
  end

  def status
    &amp;quot;#{variance}% variance, #{@items.size} items, #{@total} total, #{Integer(@total - @target)} over target&amp;quot;
  end

  # how far off we are from the target bucket size, in percent
  def variance
    (((@total - @target).to_f / @target) * 100).round(4)
  end
end



############################
# Imperative script code:
############################

# little wrapper function to kick off each set of experiments
def run_experiments(label, rounds:, buckets:, tests:)
  puts &amp;quot;===================&amp;quot;
  puts &amp;quot;#{label}: #{buckets} buckets, #{tests} tests&amp;quot;
  Experiment.new :allocate_from_ends, rounds: rounds, buckets: buckets, tests: tests
  Experiment.new :allocate_until_full, rounds: rounds, buckets: buckets, tests: tests
  Experiment.new :recursive_allocation, rounds: rounds, buckets: buckets, tests: tests
  Experiment.new :optimized_recursive_allocation, rounds: rounds, buckets: buckets, tests: tests
end

run_experiments &amp;#39;Two buckets&amp;#39;, rounds: 150, buckets: 2, tests: 40
run_experiments &amp;#39;Small scale&amp;#39;, rounds: 150, buckets: 4, tests: 40
run_experiments &amp;#39;Medium scale&amp;#39;, rounds: 150, buckets: 16, tests: 250
run_experiments &amp;#39;Large scale&amp;#39;, rounds: 30, buckets: 32, tests: 20000
# The large scale one is run for fewer rounds because it gets slow
&lt;/code&gt;&lt;/pre&gt;
</content>
  </entry>
  <entry>
    <title>Message in a bottle (in the comments)</title>
    <link rel="alternate" href="/tech/2025/04/bottle.html"/>
    <id>/tech/2025/04/bottle.html</id>
    <published>2025-04-24T20:12:00+00:00</published>
    <updated>2025-04-24T20:12:00+00:00</updated>
    <author>
      <name>Eli Thorkelson</name>
    </author>
    <summary type="html">&lt;p&gt;I had one of those trivial chores today — adding a new queue to our &lt;a href="https://sidekiq.org"&gt;Sidekiq&lt;/a&gt; worker configuration.&lt;/p&gt;

&lt;p&gt;It seems like one of those unthinking, mechanical tasks. Just find the list that defines the queues, add a new item, save, commit, send for code review...&lt;/p&gt;</summary>
    <content type="html">&lt;p&gt;I had one of those trivial chores today — adding a new queue to our &lt;a href="https://sidekiq.org"&gt;Sidekiq&lt;/a&gt; worker configuration.&lt;/p&gt;

&lt;p&gt;It seems like one of those unthinking, mechanical tasks. Just find the list that defines the queues, add a new item, save, commit, send for code review.&lt;/p&gt;

&lt;p&gt;I had already committed the code change, but not yet sent it for review, when I noticed a code comment.&lt;/p&gt;

&lt;p&gt;Among other things, I was changing some &lt;a href="https://github.com/kubernetes-sigs/kustomize/blob/master/examples/chart.md"&gt;helm kustomization files&lt;/a&gt; that pertain to Sidekiq setup. The file looked something like this (not a real code sample, just to give you the idea):&lt;/p&gt;

&lt;pre&gt;&lt;code class="yaml"&gt;sidekiq:
  deployment:
    keda:
      enable: true
  deployments:
    default:
      config:
        queues:
          # This list must be kept in sync with the autoscaling config below
          - fast
          - slow
          - medium
      spec:
        maxReplicaCount: 3
        triggers:
          - type: prometheus
            metadata:
              serverAddress: http://&amp;lt;some-prometheus-cluster&amp;gt;:9090
              threshold: &amp;quot;150&amp;quot;
              query: &amp;quot;sidekiq_queue_size{queue=~&amp;quot;(fast|slow|medium)&amp;quot;}
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;The comment says: &lt;code&gt;This list must be kept in sync with the autoscaling config below&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;The idea is, if you change which queues are handled by a pool of Sidekiq workers, you want to edit the autoscaling configuration at the same time. Otherwise, you are going to see weird system behavior when your new queue gets a long backlog and autoscaling doesn&amp;rsquo;t notice&amp;hellip;&lt;/p&gt;

&lt;p&gt;Nothing forces you to keep the two parts of the configuration in sync, except your own reading skills.&lt;/p&gt;

&lt;p&gt;And I almost missed it. Which would have been embarrassing.&lt;/p&gt;

&lt;p&gt;But here&amp;rsquo;s the funny thing: &lt;em&gt;I wrote that comment&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;It was sometime last year, when I last worked on this area of our system.&lt;/p&gt;

&lt;p&gt;I&amp;rsquo;m sure I must have thought to myself: &lt;em&gt;It would be so easy to edit the queue configuration without updating the Keda configuration. I&amp;rsquo;ll just leave a comment to point out this footgun.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Nine months later — the comment worked perfectly, and saved me from myself.&lt;/p&gt;

&lt;p&gt;You can&amp;rsquo;t rely on anyone reading the comments (not even yourself, apparently). But still.&lt;/p&gt;
</content>
  </entry>
  <entry>
    <title>Isolation vs coupling, or the problem with silos</title>
    <link rel="alternate" href="/tech/2025/04/separation-vs-connection.html"/>
    <id>/tech/2025/04/separation-vs-connection.html</id>
    <published>2025-04-13T10:18:00+00:00</published>
    <updated>2025-04-13T10:18:00+00:00</updated>
    <author>
      <name>Eli Thorkelson</name>
    </author>
    <summary type="html">&lt;p&gt;In large technical systems, there’s a pervasive tension between isolation and coupling. What I mean by this is:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;There are architectural advantages to having strong walls between things.&lt;/li&gt;
&lt;li&gt;However, there are also architectural advantages to having tight...&lt;/li&gt;
&lt;/ul&gt;</summary>
    <content type="html">&lt;p&gt;In large technical systems, there&amp;rsquo;s a pervasive tension between isolation and coupling. What I mean by this is:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;There are architectural advantages to having strong walls between things.&lt;/li&gt;
&lt;li&gt;However, there are also architectural advantages to having tight coupling between things.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These two things often push in opposite directions, making system design awkward, or at least delicate.&lt;/p&gt;

&lt;p&gt;Let&amp;rsquo;s consider a few obvious examples.&lt;/p&gt;

&lt;h2 id="tenant-silos-or-horizontal-boundaries"&gt;Tenant silos, or horizontal boundaries&lt;/h2&gt;

&lt;p&gt;Suppose we are designing a multitenanted system. We might need to go in one of two contrary directions:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Put every tenant in its own technical silo, keep everything highly separate. Use separate databases, separate workers, separate infrastructure for each tenant.&lt;/li&gt;
&lt;li&gt;&lt;em&gt;or:&lt;/em&gt; Build multi-tenanted systems that cross logical boundaries and work with data across silos. Share resources (which might reduce infrastructure costs).&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Isolating things into silos can be good for security, for privacy, and possibly for operational flexibility. Maybe we want to allocate more resources for certain tenants. Maybe we want to define totally separate user accounts for every tenant.&lt;/p&gt;

&lt;p&gt;But isolation can also be inefficient and operationally frustrating — suppose we want to let all the tenants share certain resources, but now this is impossible.&lt;/p&gt;

&lt;p&gt;Or suppose we want to know &amp;ldquo;across all tenants, how many records of type X do we have?&amp;rdquo; It sucks when the answer is &amp;ldquo;sorry, we don&amp;rsquo;t even have tooling to find that out right now.&amp;rdquo;&lt;/p&gt;

&lt;p&gt;Or consider the costs of releasing updates - do you want deployment to be O(1) or O(N) for &lt;em&gt;N&lt;/em&gt; tenants? Do you want to run database migrations 500 times or just once?&lt;/p&gt;

&lt;p&gt;Or consider the problem of users with access to multiple tenants — do you want them to have to authenticate one time or &lt;em&gt;N&lt;/em&gt; times to get access to &lt;em&gt;N&lt;/em&gt; tenants?&lt;/p&gt;

&lt;p&gt;Of course, you can always try to hack your way through this problem space, and get some of the benefits of both approaches. You can build approach 1 (premised on isolation) and then add some hacks — ahem, I mean &lt;em&gt;extensions&lt;/em&gt; — that get you some of the benefits of approach 2 (premised on coupling). Or vice versa.&lt;/p&gt;

&lt;h2 id="web-stack-components-or-vertical-boundaries"&gt;Web stack components, or vertical boundaries&lt;/h2&gt;

&lt;p&gt;Here&amp;rsquo;s a second example. Suppose you have a web stack that is organized into different layers like this:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;External load balancer&lt;/li&gt;
&lt;li&gt;NGINX&lt;/li&gt;
&lt;li&gt;Ruby application server&lt;/li&gt;
&lt;li&gt;Rack middleware stack&lt;/li&gt;
&lt;li&gt;Ruby on Rails application&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Requests normally start at the first layer and, assuming everything is OK, then get passed down to the last layer. Your application layer then handles a response and sends a response back through all the other layers to the client.&lt;/p&gt;

&lt;p&gt;This scenario &lt;em&gt;also&lt;/em&gt; has a design space with a tradeoff between isolation and coupling, though not exactly the same kind as the horizontal tenant sharding case.&lt;/p&gt;

&lt;p&gt;Essentially: Everything is cleaner and simpler when different layers of this stack are indifferent to each other. Everything is swappable. There are no hard dependencies. You want to change load balancers? Go ahead, it doesn&amp;rsquo;t affect anything in the application server.&lt;/p&gt;

&lt;p&gt;But. When the architectural isolation between layers is too absolute, certain use cases become impossible.&lt;/p&gt;

&lt;p&gt;A use case we&amp;rsquo;ve looked at recently involves performance logging. Our default observability stack runs at the application level. It only starts when the other layers are complete.&lt;/p&gt;

&lt;p&gt;This means it has blind spots.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;If a request never makes it to the application, it is invisible.&lt;/li&gt;
&lt;li&gt;If the application hits a request timeout, the observability tools crash with it, and you don&amp;rsquo;t get logs of what happened.&lt;/li&gt;
&lt;li&gt;If you want to know how long a request took, you can only measure starting when the observability tools kick in. What if you want to measure the time starting at level 1 all the way down to level 5?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;To fix these problems, you have to build observability tooling that crosses levels of the web stack.&lt;/p&gt;

&lt;p&gt;It&amp;rsquo;s great - now you have more information.&lt;/p&gt;

&lt;p&gt;But you also have more dependencies, more tight coupling, more complex (often ad hoc) contracts, and thus more brittleness.&lt;/p&gt;

&lt;p&gt;Suppose we started setting a custom request header at the load balancer, to measure total request time.&lt;/p&gt;

&lt;p&gt;Now if you switch to a new load balancer, you have to remember to migrate your custom header, or the custom observability tooling will break. You have a new constraint. You&amp;rsquo;ve moved away from total isolation towards great coupling between components. At your peril.&lt;/p&gt;

&lt;p&gt;Every way you go, there are perils.&lt;/p&gt;

&lt;h2 id="discussion"&gt;Discussion&lt;/h2&gt;

&lt;p&gt;Needless to say, architectural purism is rarely optimal. You go in one direction and then tack back in the other. You want to satisfy competing constraints to the extent you can.&lt;/p&gt;

&lt;p&gt;But the fact that we often end up in some middle zone of this design space doesn&amp;rsquo;t mean that there aren&amp;rsquo;t competing pressures in opposite directions. In this sense, large technical systems rarely reach a stable equilibrium in their own design space. Rather, they exhibit a moving set of tradeoffs that can shift suddenly under your feet.&lt;/p&gt;

&lt;p&gt;This problem space is closely related to the more well-known problems of &amp;ldquo;standardization vs nonstandardization (autonomy),&amp;rdquo; which also provide a major tradeoff space. But if you think closely about it, they are orthogonal.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;                 autonomy
                    ^
                    |
isolation  &amp;lt;==           ==&amp;gt; coupling
                    |
                    v
              standardization
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;All your teams can align on a model of tenant silos - that&amp;rsquo;s standardizing on isolation. Or some of your teams can build global systems and some can build tenant systems - that&amp;rsquo;s more autonomous, but it&amp;rsquo;s orthogonal to whether systems are more or less connected to each other.&lt;/p&gt;

&lt;h2 id="coda-leaky-abstractions"&gt;Coda: Leaky abstractions&lt;/h2&gt;

&lt;p&gt;I keep thinking this is related to the problem of leaky abstractions, though I haven&amp;rsquo;t quite found the words to talk about this. Every time we design for &amp;ldquo;isolation,&amp;rdquo; we are always also designing interfaces for coupling across our silos, our layers. These interfaces somehow are never good enough, and then we are tempted to realign the architecture, or at least the interfaces, to plug the gaps in our system.&lt;/p&gt;

&lt;p&gt;In any case, I constantly see people building isolated &amp;ldquo;tenanted silos&amp;rdquo; and then, a few years later, going back towards tightly coupled &amp;ldquo;global services,&amp;rdquo; or vice versa. Like we&amp;rsquo;re all caught in some giant pendulum of unstable requirements.&lt;/p&gt;
</content>
  </entry>
  <entry>
    <title>The pitfalls of things that seem easy</title>
    <link rel="alternate" href="/tech/2025/03/pitfalls.html"/>
    <id>/tech/2025/03/pitfalls.html</id>
    <published>2025-03-14T13:32:00+00:00</published>
    <updated>2025-03-14T13:32:00+00:00</updated>
    <author>
      <name>Eli Thorkelson</name>
    </author>
    <summary type="html">&lt;p&gt;Sometimes you want to do something easy and it does not go well.&lt;/p&gt;

&lt;h2 id="it-looked-like-an-easy-hike-an-allegory"&gt;It looked like an easy hike: An allegory&lt;/h2&gt;

&lt;p&gt;Once upon a time, I was on vacation and I wanted to hike over to these hills in the middle distance.&lt;/p&gt;

&lt;p&gt;I wasn’t familiar with the desert landscape...&lt;/p&gt;</summary>
    <content type="html">&lt;p&gt;Sometimes you want to do something easy and it does not go well.&lt;/p&gt;

&lt;h2 id="it-looked-like-an-easy-hike-an-allegory"&gt;It looked like an easy hike: An allegory&lt;/h2&gt;

&lt;p&gt;Once upon a time, I was on vacation and I wanted to hike over to these hills in the middle distance.&lt;/p&gt;

&lt;p&gt;I wasn&amp;rsquo;t familiar with the desert landscape, but I&amp;rsquo;ve hiked in a lot of other places. This hike didn&amp;rsquo;t look that hard.&lt;/p&gt;

&lt;p&gt;&lt;img src="/images/tech/mountains.jpg" alt="some mountains in the distance"&gt;&lt;/p&gt;

&lt;p&gt;You can see that there are some canyons that block the way, but I thought they looked small. I thought you could just go around them. I thought they had gentle slopes that wouldn&amp;rsquo;t be difficult to climb.&lt;/p&gt;

&lt;p&gt;So I set out.&lt;/p&gt;

&lt;p&gt;And I found out that once you were in the canyons, they became a maze that wasn&amp;rsquo;t easy to cross.&lt;/p&gt;

&lt;p&gt;&lt;img src="/images/tech/valley.jpg" alt="brushy valley"&gt;&lt;/p&gt;

&lt;p&gt;And I found out that the vegetation was huge and thorny.&lt;/p&gt;

&lt;p&gt;&lt;img src="/images/tech/cactus.jpg" alt="huge cacti"&gt;&lt;/p&gt;

&lt;p&gt;And I found out that the &amp;ldquo;gentle slopes&amp;rdquo; were an ordeal with many obstacles.&lt;/p&gt;

&lt;p&gt;&lt;img src="/images/tech/hillside.jpg" alt="steep hillside"&gt;&lt;/p&gt;

&lt;p&gt;I turned around pretty quickly because I realized I was outmatched by the landscape.&lt;/p&gt;

&lt;h2 id="software-is-full-of-this"&gt;Software is full of this&lt;/h2&gt;

&lt;p&gt;I think every software developer has had this kind of experience (metaphorically speaking).&lt;/p&gt;

&lt;p&gt;There&amp;rsquo;s somewhere you&amp;rsquo;d like to get; you can easily see the destination; it doesn&amp;rsquo;t look that far away.&lt;/p&gt;

&lt;p&gt;And you weren&amp;rsquo;t familiar with the landscape in detail.&lt;/p&gt;

&lt;p&gt;And you underestimated the pitfalls that were waiting for you.&lt;/p&gt;

&lt;p&gt;And&amp;hellip; even if you kept going, it was vastly harder than you thought, and you had to solve puzzles that didn&amp;rsquo;t seem related to the original goal. Getting untangled from cactus spines, for example. Incidental complexities, or at least &lt;em&gt;seemingly&lt;/em&gt; incidental complexities.&lt;/p&gt;

&lt;p&gt;One can think here of the &lt;a href="https://en.wikipedia.org/wiki/Dunning%E2%80%93Kruger_effect"&gt;Dunning-Kruger effect&lt;/a&gt;, which can always get you, even for overall pretty competent people. It&amp;rsquo;s easy to overestimate one&amp;rsquo;s capabilities.&lt;/p&gt;

&lt;p&gt;But I prefer to think of it as being less about competence than about familiarity with an environment.&lt;/p&gt;

&lt;p&gt;It&amp;rsquo;s just hard to estimate what&amp;rsquo;s feasible in an unfamiliar environment.&lt;/p&gt;

&lt;p&gt;Even in a familiar environment, things that &lt;em&gt;ought&lt;/em&gt; to be easy can just be very hard.&lt;/p&gt;

&lt;p&gt;This gets me to the original thought that inspired this post:&lt;/p&gt;

&lt;p&gt;It&amp;rsquo;s funny how simple, easy tasks can get blocked by other tasks that are dramatically harder.&lt;/p&gt;

&lt;p&gt;You want to clean up some simple taxonomy of environment values, like I did the other day, &lt;a href="/tech/2025/02/configuration-management.html"&gt;but first you have to clean up a small mountain of other things&lt;/a&gt; you didn&amp;rsquo;t really want to touch.&lt;/p&gt;

&lt;p&gt;There&amp;rsquo;s an art to knowing when to bother.&lt;/p&gt;
</content>
  </entry>
  <entry>
    <title>On actually reading error messages</title>
    <link rel="alternate" href="/tech/2025/03/reading-stacktraces.html"/>
    <id>/tech/2025/03/reading-stacktraces.html</id>
    <published>2025-03-14T13:01:00+00:00</published>
    <updated>2025-03-14T13:01:00+00:00</updated>
    <author>
      <name>Eli Thorkelson</name>
    </author>
    <summary type="html">&lt;p&gt;Not all error messages are beautiful and concise.&lt;/p&gt;

&lt;p&gt;Frequently we see error messages in the form of a long stacktrace.&lt;/p&gt;

&lt;pre&gt;&lt;code class="ruby"&gt;./ruby/3.3.0/net/http.rb:1603:in `initialize': Failed to open TCP connection
to nonexistenthostname.local:80 (getaddrinfo: nodename...&lt;/code&gt;&lt;/pre&gt;</summary>
    <content type="html">&lt;p&gt;Not all error messages are beautiful and concise.&lt;/p&gt;

&lt;p&gt;Frequently we see error messages in the form of a long stacktrace.&lt;/p&gt;

&lt;pre&gt;&lt;code class="ruby"&gt;./ruby/3.3.0/net/http.rb:1603:in `initialize&amp;#39;: Failed to open TCP connection
to nonexistenthostname.local:80 (getaddrinfo: nodename nor servname provided,
or not known) (Socket::ResolutionError)
    from ./ruby/3.3.0/net/http.rb:1603:in `open&amp;#39;
    from ./ruby/3.3.0/net/http.rb:1603:in `block in connect&amp;#39;
    from ./ruby/3.3.0/timeout.rb:186:in `block in timeout&amp;#39;
    from ./ruby/3.3.0/timeout.rb:193:in `timeout&amp;#39;
    from ./ruby/3.3.0/net/http.rb:1601:in `connect&amp;#39;
    from ./ruby/3.3.0/net/http.rb:1580:in `do_start&amp;#39;
    from ./ruby/3.3.0/net/http.rb:1569:in `start&amp;#39;
    from ./ruby/3.3.0/open-uri.rb:334:in `open_http&amp;#39;
    from ./ruby/3.3.0/open-uri.rb:770:in `buffer_open&amp;#39;
    from ./ruby/3.3.0/open-uri.rb:220:in `block in open_loop&amp;#39;
    from ./ruby/3.3.0/open-uri.rb:218:in `catch&amp;#39;
    from ./ruby/3.3.0/open-uri.rb:218:in `open_loop&amp;#39;
    from ./ruby/3.3.0/open-uri.rb:158:in `open_uri&amp;#39;
    from ./ruby/3.3.0/open-uri.rb:750:in `open&amp;#39;
    from ./ruby/3.3.0/open-uri.rb:29:in `open&amp;#39;
    from (irb):7:in `&amp;lt;main&amp;gt;&amp;#39;
    ... 4 levels...
./ruby/3.3.0/net/http.rb:1603:in `initialize&amp;#39;: getaddrinfo: nodename nor
servname provided, or not known (Socket::ResolutionError)
    from ./ruby/3.3.0/net/http.rb:1603:in `open&amp;#39;
    from ./ruby/3.3.0/net/http.rb:1603:in `block in connect&amp;#39;
    from ./ruby/3.3.0/timeout.rb:186:in `block in timeout&amp;#39;
    from ./ruby/3.3.0/timeout.rb:193:in `timeout&amp;#39;
    from ./ruby/3.3.0/net/http.rb:1601:in `connect&amp;#39;
    from ./ruby/3.3.0/net/http.rb:1580:in `do_start&amp;#39;
    from ./ruby/3.3.0/net/http.rb:1569:in `start&amp;#39;
    from ./ruby/3.3.0/open-uri.rb:334:in `open_http&amp;#39;
    from ./ruby/3.3.0/open-uri.rb:770:in `buffer_open&amp;#39;
    from ./ruby/3.3.0/open-uri.rb:220:in `block in open_loop&amp;#39;
    from ./ruby/3.3.0/open-uri.rb:218:in `catch&amp;#39;
    from ./ruby/3.3.0/open-uri.rb:218:in `open_loop&amp;#39;
    from ./ruby/3.3.0/open-uri.rb:158:in `open_uri&amp;#39;
    from ./ruby/3.3.0/open-uri.rb:750:in `open&amp;#39;
    from ./ruby/3.3.0/open-uri.rb:29:in `open&amp;#39;
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;To be clear, this is just a DNS lookup error.&lt;/p&gt;

&lt;p&gt;I sent a test request to a nonexistent hostname and got this stacktrace.&lt;/p&gt;

&lt;p&gt;But here&amp;rsquo;s the thing:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;A surprising number of professional technology people will look at this, see that there&amp;rsquo;s a long stacktrace, and just abandon all hope.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;They don&amp;rsquo;t recognize this long blob of text.&lt;/p&gt;

&lt;p&gt;It&amp;rsquo;s unpleasant looking.&lt;/p&gt;

&lt;p&gt;So they decide in advance they probably can&amp;rsquo;t understand it.&lt;/p&gt;

&lt;p&gt;There are, indeed, errors that you won&amp;rsquo;t understand, even when you read every line of the stacktrace. No one understands everything. Technical systems are complicated. Understanding them is an investment that you can&amp;rsquo;t always make.&lt;/p&gt;

&lt;p&gt;But if you don&amp;rsquo;t actually read the errors — you might give up prematurely.&lt;/p&gt;

&lt;p&gt;No, to be precise, you &lt;em&gt;will&lt;/em&gt; give up prematurely.&lt;/p&gt;

&lt;p&gt;To be clear, I&amp;rsquo;m not against anyone asking for help. Groups are often smarter than individuals, or at least much more knowledgeable. It&amp;rsquo;s not bad to check if someone else already knows the answer.&lt;/p&gt;

&lt;p&gt;But it would be nice if people read the errors.&lt;/p&gt;

&lt;p&gt;Maybe even before asking.&lt;/p&gt;
</content>
  </entry>
  <entry>
    <title>The invisible layers of the stack, or why whitespace broke authentication</title>
    <link rel="alternate" href="/tech/2025/03/the-invisible-layers.html"/>
    <id>/tech/2025/03/the-invisible-layers.html</id>
    <published>2025-03-13T20:47:00+00:00</published>
    <updated>2025-03-13T20:47:00+00:00</updated>
    <author>
      <name>Eli Thorkelson</name>
    </author>
    <summary type="html">&lt;script type="module"&gt;
  import mermaid from 'https://cdn.jsdelivr.net/npm/mermaid@11/dist/mermaid.esm.min.mjs';
  mermaid.initialize({ startOnLoad: true, theme: 'forest' });
&lt;/script&gt;

&lt;p&gt;Someone messaged me the other day asking for assistance.&lt;/p&gt;

&lt;p&gt;Their question was intriguing: “Can...&lt;/p&gt;</summary>
    <content type="html">&lt;script type="module"&gt;
  import mermaid from 'https://cdn.jsdelivr.net/npm/mermaid@11/dist/mermaid.esm.min.mjs';
  mermaid.initialize({ startOnLoad: true, theme: 'forest' });
&lt;/script&gt;

&lt;p&gt;Someone messaged me the other day asking for assistance.&lt;/p&gt;

&lt;p&gt;Their question was intriguing: &amp;ldquo;Can you help figure out why integration requests to our API have started failing, &lt;em&gt;with an authentication error&lt;/em&gt;, when extra whitespace is added to the incoming JSON payload?&amp;rdquo;&lt;/p&gt;

&lt;h2 id="the-mysterious-case-of-whitespace-that-breaks-api-authentication"&gt;The mysterious case of whitespace that&amp;hellip; breaks API authentication&lt;/h2&gt;

&lt;p&gt;The breaking requests looked something like this:&lt;/p&gt;

&lt;pre&gt;&lt;code class="json"&gt;{
  &amp;quot;data&amp;quot; : {
    &amp;quot;type&amp;quot; : &amp;quot;object&amp;quot;,
    &amp;quot;name&amp;quot; : &amp;quot;roses&amp;quot;,
    &amp;quot;color&amp;quot; : &amp;quot;red&amp;quot;
  }
}
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Our systems responded with HTTP 403 when given this payload, even though they were properly authenticated.&lt;/p&gt;

&lt;p&gt;Meanwhile, the very same requests would succeed if formatted like this:&lt;/p&gt;

&lt;pre&gt;&lt;code class="json"&gt;{
  &amp;quot;data&amp;quot;: {
    &amp;quot;type&amp;quot;: &amp;quot;object&amp;quot;,
    &amp;quot;name&amp;quot;: &amp;quot;roses&amp;quot;,
    &amp;quot;color&amp;quot;: &amp;quot;red&amp;quot;
  }
}
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;The only difference is the presence of extra whitespace in between the JSON keys and the subsequent colon.&lt;/p&gt;

&lt;p&gt;I&amp;rsquo;m used to seeing JSON object keys formatted with no white space in between &lt;code&gt;&amp;quot;key&amp;quot;&lt;/code&gt; and the subsequent colon &lt;code&gt;:&lt;/code&gt;, like &lt;code&gt;&amp;quot;key&amp;quot;: &amp;quot;value&amp;quot;&lt;/code&gt;. I double checked the JSON spec in &lt;a href="https://datatracker.ietf.org/doc/html/rfc7159#section-2"&gt;RFC 7159 section 2&lt;/a&gt; and found that both ways are perfectly valid JSON:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Insignificant whitespace is allowed before or after any of the six structural characters.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;(The &amp;ldquo;structural characters&amp;rdquo; include &lt;code&gt;name-separator  = ws %x3A ws  ; : colon&lt;/code&gt;.)&lt;/p&gt;

&lt;h2 id="where-does-the-problem-occur"&gt;Where does the problem occur?&lt;/h2&gt;

&lt;p&gt;I checked to see if we had changed our application&amp;rsquo;s JSON parsing library. I checked if anything had changed on the sending side. In both cases, nothing had recently changed.&lt;/p&gt;

&lt;p&gt;In our configuration, our public API endpoints are normally handled by &lt;a href="https://aws.amazon.com/api-gateway/"&gt;AWS API Gateway&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;API Gateway has a bit of internal complexity but roughly we use it like this:&lt;/p&gt;

&lt;pre class="mermaid"&gt;
sequenceDiagram
    participant c as Client
    participant gw as AWS API Gateway
    participant app as Application Server

    c-&gt;&gt;gw: Incoming request
    gw-&gt;&gt;gw: Checks authentication
    Note right of gw: Could have auth/parsing errors?
    gw-&gt;&gt;app: Sends allowed, authenticated requests
    Note right of app: Could have auth/parsing errors?
    app--&gt;&gt;gw: Returns response
    gw--&gt;&gt;c: Returns response
&lt;/pre&gt;

&lt;p&gt;We don&amp;rsquo;t expect API Gateway to do much more than handle authentication and a bit of URL path routing for us. We expect it to pass all authenticated inbound payloads to our back end servers for processing.&lt;/p&gt;

&lt;p&gt;However, strangely, the whitespace problem was only reproducible when requests were sent through AWS API Gateway. If we sent test requests straight to our application servers, bypassing the API Gateway, then the whitespace issue vanished.&lt;/p&gt;

&lt;p&gt;This helps to localize the problem.&lt;/p&gt;

&lt;h2 id="what-does-api-gateway-really-do"&gt;What does API Gateway really do?&lt;/h2&gt;

&lt;p&gt;I checked whether API Gateway does any kind of processing on the request payloads. I was curious if it attempted to validate JSON syntax and if it somehow considered the extra whitespace a syntax error.&lt;/p&gt;

&lt;p&gt;No, and no.&lt;/p&gt;

&lt;p&gt;And even if it did validate request payloads, one would not expect it to emit an auth error (&lt;a href="https://developer.mozilla.org/en-US/docs/Web/HTTP/Reference/Status/403"&gt;HTTP 403 Forbidden&lt;/a&gt;, in particular).&lt;/p&gt;

&lt;p&gt;So I started looking for the API Gateway logs to see what it thought was happening.&lt;/p&gt;

&lt;h2 id="its-the-invisible-layers-of-the-stack"&gt;It&amp;rsquo;s the invisible layers of the stack&lt;/h2&gt;

&lt;p&gt;The API Gateway access logs promptly revealed the problem: the broken requests were all being blocked by AWS WAF. That&amp;rsquo;s the &lt;a href="https://aws.amazon.com/waf/"&gt;web application firewall&lt;/a&gt;, a service that inspects incoming requests and tries to block malicious traffic.&lt;/p&gt;

&lt;p&gt;The API Gateway logs themselves didn&amp;rsquo;t reveal &lt;em&gt;why&lt;/em&gt; these requests were blocked by the WAF, but this was already enough to work with. We chatted with the infrastructure person who maintains WAF configuration, and found that some new filtering rules had recently been deployed. These rules had erroneously blocked some legit traffic since deployment. The QA team initially discovered the problem in our staging environment, and escalated to engineering.&lt;/p&gt;

&lt;p&gt;Around this point, I realized my high-level mental model of the system was incomplete.&lt;/p&gt;

&lt;p&gt;An expanded model would look like this:&lt;/p&gt;

&lt;pre class="mermaid"&gt;
sequenceDiagram
    participant c as Client
    participant gw as AWS API Gateway
    participant waf as AWS WAF
    participant app as Application Server

    c-&gt;&gt;gw: Incoming request
    gw-&gt;&gt;gw: Checks authentication
    Note right of gw: Auth/parsing errors?
    gw-&gt;&gt;waf: Checks requests against WAF rule set
    Note right of waf: What happens here?
    waf--&gt;&gt;gw: Responds blocked or allowed
    gw-&gt;&gt;app: Sends allowed, authenticated requests
    Note right of app: Auth/parsing errors?
    app--&gt;&gt;gw: Returns response
    gw--&gt;&gt;c: Returns response
&lt;/pre&gt;

&lt;p&gt;In other words, WAF constitutes another major failure point in the request flow, one that I hadn&amp;rsquo;t given much thought to. It&amp;rsquo;s always the invisible layers in the stack that get you. There&amp;rsquo;s just so much happening in complex technical systems, and it&amp;rsquo;s easy to neglect the layers that usually work silently, without problems.&lt;/p&gt;

&lt;h2 id="but-why-would-waf-block-json-because-of-extra-whitespace"&gt;But why would WAF block JSON because of extra whitespace&lt;/h2&gt;

&lt;p&gt;I don&amp;rsquo;t have the whole story here as I don&amp;rsquo;t have access to WAF rule definitions. But I learned that the request was being blocked by a rule set that tried to block known Windows shell attacks. I presume something about &lt;code&gt;&amp;quot;type&amp;quot; :&lt;/code&gt; triggered some regex.&lt;/p&gt;

&lt;p&gt;Our back end servers don&amp;rsquo;t run Windows, but these rules were enabled anyway, causing the issue.&lt;/p&gt;

&lt;p&gt;As &lt;a href="https://repost.aws/knowledge-center/waf-block-respond-403-forbidden-error"&gt;AWS docs explain&lt;/a&gt;, WAF will respond with HTTP 403 when a request is blocked. Unfortunately this status is also sometimes emitted by our downstream application servers, so it doesn&amp;rsquo;t help us localize the issue. But at least now I know to look for this case if we see 403s.&lt;/p&gt;

&lt;h2 id="fixing-blind-spots"&gt;Fixing blind spots&lt;/h2&gt;

&lt;p&gt;We&amp;rsquo;ll probably deploy better reporting to try to prevent this scenario in the future. We could have caught it faster if we had alerts on the right things in our logs.&lt;/p&gt;

&lt;p&gt;It&amp;rsquo;s still very unintuitive that anyone could ever get an auth error from adding extra whitespace in a JSON payload. But things make more sense when you remember that every request is screened against a mostly-opaque set of security rulesets that probably don&amp;rsquo;t even consider the inbound request format.&lt;/p&gt;

&lt;p&gt;The thing people like about AWS WAF is that it is supposed to mitigate security risks without much work on the application developer&amp;rsquo;s part. The problem is, this requires that it should work invisibly like a black box without ever failing. Then if it doesn&amp;rsquo;t work properly - you still end up having to partly understand it anyway.&lt;/p&gt;

&lt;p&gt;At the very least — now we remember it&amp;rsquo;s there, and we know that this failure mode is possible.&lt;/p&gt;
</content>
  </entry>
  <entry>
    <title>A rat’s nest of configuration management</title>
    <link rel="alternate" href="/tech/2025/02/configuration-management.html"/>
    <id>/tech/2025/02/configuration-management.html</id>
    <published>2025-02-20T11:49:00+00:00</published>
    <updated>2025-02-20T11:49:00+00:00</updated>
    <author>
      <name>Eli Thorkelson</name>
    </author>
    <summary type="html">&lt;p&gt;I decided to clean up our environment type taxonomy a little bit at work. This involved a little journey into configuration management.&lt;/p&gt;

&lt;h2 id="configuration-strategies"&gt;Configuration strategies&lt;/h2&gt;

&lt;p&gt;Obviously, there isn’t just one way to configure your software. A few patterns I’ve noticed...&lt;/p&gt;</summary>
    <content type="html">&lt;p&gt;I decided to clean up our environment type taxonomy a little bit at work. This involved a little journey into configuration management.&lt;/p&gt;

&lt;h2 id="configuration-strategies"&gt;Configuration strategies&lt;/h2&gt;

&lt;p&gt;Obviously, there isn&amp;rsquo;t just one way to configure your software. A few patterns I&amp;rsquo;ve noticed lately:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Hardcoded constants in the codebase&lt;/li&gt;
&lt;li&gt;Conditional logic in the codebase that dynamically generates configuration values&lt;/li&gt;
&lt;li&gt;YAML configuration files with global scope&lt;/li&gt;
&lt;li&gt;YAML configuration files scoped by environment type&lt;/li&gt;
&lt;li&gt;Direct access to environment variables from application code&lt;/li&gt;
&lt;li&gt;Dynamic configuration loading from external configuration services.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;At a high level, we can distinguish between static configuration and dynamic configuration. (This might be more of a spectrum than a binary distinction; for example, dynamic configuration at initialization time is logically distinct from dynamic configuration that can change at runtime.)&lt;/p&gt;

&lt;p&gt;I think it&amp;rsquo;s pretty reasonable to use static configuration methods like code constants for values you are certain will never change dynamically. But beware: if you write a code constant that says &lt;code&gt;TIMEOUT_SECONDS = 3&lt;/code&gt; and one day the production system urgently needs &lt;code&gt;TIMEOUT_SECONDS = 10&lt;/code&gt;, you will wish you had provided a dynamic configuration hook for this value. At the same time, too much configurability is also bad: it inevitably leads to inconsistency across runtime contexts that you don&amp;rsquo;t want.&lt;/p&gt;

&lt;p&gt;Well, anyway, here I&amp;rsquo;m just interested in configuration that&amp;rsquo;s based on environment type, because what I wanted to do was clean up our environment types.&lt;/p&gt;

&lt;h2 id="rails-configuration"&gt;Rails configuration&lt;/h2&gt;

&lt;p&gt;Our applications tend to have complicated settings files organized by &lt;a href="https://guides.rubyonrails.org/configuring.html#rails-environment-settings"&gt;RAILS_ENV&lt;/a&gt;, a minimalist environment taxonomy provided by Rails which implicitly provides a set of allowed environment types (&lt;code&gt;development&lt;/code&gt;, &lt;code&gt;test&lt;/code&gt;, &lt;code&gt;staging&lt;/code&gt;, &lt;code&gt;production&lt;/code&gt;, and anything else you want).&lt;/p&gt;

&lt;p&gt;According to the default &amp;ldquo;Rails way,&amp;rdquo; you can configure your application by setting environment-specific settings in an environment-specific configuration file located at &lt;code&gt;config/environments/&amp;lt;ENVIRONMENT_NAME&amp;gt;.rb&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;In our case, we have a large number of business logic flags and settings that don&amp;rsquo;t fit neatly into that approach. We handle this case using Rails&amp;rsquo; &lt;a href="https://api.rubyonrails.org/classes/Rails/Application.html#method-i-config_for"&gt;config_for&lt;/a&gt;, which provides support for nicely loading a YAML file organized (by default) by &lt;code&gt;RAILS_ENV&lt;/code&gt;. The YAML file in question can also evaluate ERB templates, meaning that you can put arbitrary Ruby logic into it.&lt;/p&gt;

&lt;p&gt;In practice, mainly we use ERB to evaluate environment variables. Naturally, it ends up with a tangled mess, something like this:&lt;/p&gt;

&lt;pre&gt;&lt;code class="yaml"&gt;# settings.yml
shared:
  allow_user_registration: true
  send_mail_from_domain: example.com
  cleanup_old_records: &amp;lt;%= ENV.fetch(&amp;quot;CLEANUP_OLD_RECORDS&amp;quot;, &amp;quot;false&amp;quot;) == &amp;quot;true&amp;quot;) %&amp;gt;
development:
  allow_user_registration: false
  cleanup_old_records: &amp;lt;%= ENV.fetch(&amp;quot;CLEANUP_OLD_RECORDS&amp;quot;, &amp;quot;false&amp;quot;) == &amp;quot;true&amp;quot;) %&amp;gt;
  user_rate_limit: 25
qa:
  &amp;lt;&amp;lt;: *staging
  allow_user_registration: &amp;lt;%= ENV.fetch(&amp;quot;ALLOW_USER_REGISTRATION&amp;quot;, &amp;quot;true&amp;quot;) == &amp;quot;true&amp;quot;) %&amp;gt;
  cleanup_old_records: &amp;lt;%= ENV[&amp;quot;CLEANUP_POLICY&amp;quot;] == &amp;quot;strict&amp;quot;
    &amp;amp;&amp;amp; ENV.fetch(&amp;quot;CLEANUP_OLD_RECORDS&amp;quot;, &amp;quot;false&amp;quot;) == &amp;quot;true&amp;quot;) %&amp;gt;
  user_rate_limit: &amp;lt;%= ENV[&amp;quot;USER_RATE_LIMIT&amp;quot;] || 5 %&amp;gt;
staging: &amp;amp;staging
  allow_user_registration: &amp;lt;%= ENV.fetch(&amp;quot;ALLOW_USER_REGISTRATION&amp;quot;, &amp;quot;true&amp;quot;) == &amp;quot;true&amp;quot;) %&amp;gt;
  cleanup_old_records: &amp;lt;%= ENV.fetch(&amp;quot;CLEANUP_OLD_RECORDS&amp;quot;, &amp;quot;true&amp;quot;) == &amp;quot;true&amp;quot;) %&amp;gt;
  user_rate_limit: &amp;lt;%= ENV[&amp;quot;SECONDARY_USER_RATE_LIMIT&amp;quot;] || 10 %&amp;gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Now suppose you have multiple Rails services built at different times using different configuration structures. And you would like to clean them up and make them consistent. Imagine that one of your services uses &lt;code&gt;RAILS_ENV=&amp;quot;qa&amp;quot;&lt;/code&gt; in the QA environment, but all the others use &lt;code&gt;RAILS_ENV=&amp;quot;staging&amp;quot;&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;This can be cleaned up parsimoniously by removing &lt;code&gt;RAILS_ENV=qa&lt;/code&gt; from the one service that is not like the others.&lt;/p&gt;

&lt;p&gt;This, however, means that you have to read all the configuration currently under &lt;code&gt;qa&lt;/code&gt; and move it into the &lt;code&gt;staging&lt;/code&gt; configuration context. A project which, though I&amp;rsquo;m no Hercules, reminds me vaguely of &lt;a href="https://en.wikipedia.org/wiki/Labours_of_Hercules#Fifth:_Augean_stables"&gt;cleaning the Augean stables&lt;/a&gt;.&lt;/p&gt;

&lt;h2 id="fixing-inconsistencies-one-by-one"&gt;Fixing inconsistencies one by one&lt;/h2&gt;

&lt;p&gt;Let&amp;rsquo;s consider the cases here one by one.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;allow_user_registration&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;Both &lt;code&gt;qa&lt;/code&gt; and &lt;code&gt;staging&lt;/code&gt; are set to the same environment variable &lt;code&gt;ALLOW_USER_REGISTRATION&lt;/code&gt; with the same default. In this case, it should be fine to switch to &lt;code&gt;staging&lt;/code&gt; without doing anything.&lt;/p&gt;

&lt;p&gt;But it still took a minute to read both lines of code, compare them, and make sure that they&amp;rsquo;re identical. &amp;ldquo;Manual toil,&amp;rdquo; we call this at work. (This specific case could be handled by an LLM, I think, but you would still have to verify the results, which would take as long as doing it manually, because errors are impermissible.)&lt;/p&gt;

&lt;p&gt;&lt;code&gt;cleanup_old_records&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;qa&lt;/code&gt; config checks two different conditions based on two different env vars (&lt;code&gt;CLEANUP_POLICY&lt;/code&gt;, &lt;code&gt;CLEANUP_OLD_RECORDS&lt;/code&gt;). Meanwhile, &lt;code&gt;staging&lt;/code&gt; just checks one env var, &lt;code&gt;CLEANUP_OLD_RECORDS&lt;/code&gt;, with a different default from &lt;code&gt;qa&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;What you have to do in this case to consolidate them is to discover what the use case is here, check what the current env vars are actually set to in the &lt;code&gt;qa&lt;/code&gt; environment, and then, if possible, consolidate on only using &lt;code&gt;CLEANUP_OLD_RECORDS&lt;/code&gt;, so that you can fall back to the &lt;code&gt;staging&lt;/code&gt; config safely. You might have to alter the env vars themselves to make this work (we usually set env vars in code, using Helm charts, so this is fine).&lt;/p&gt;

&lt;p&gt;This is becoming a research project, because you have to understand the use case and check env vars from a different system. I&amp;rsquo;m not convinced an LLM can do this yet, because it requires digging around elsewhere in a large technical system.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;user_rate_limit&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;Both &lt;code&gt;qa&lt;/code&gt; and &lt;code&gt;staging&lt;/code&gt; are set to the same environment variable &lt;code&gt;ALLOW_USER_REGISTRATION&lt;/code&gt;, but with different defaults. In this case you have to ensure that the &lt;code&gt;ALLOW_USER_REGISTRATION&lt;/code&gt; env var is populated in &lt;code&gt;qa&lt;/code&gt; and has the correct value (you can set it to &lt;code&gt;10&lt;/code&gt; if it is not already set).&lt;/p&gt;

&lt;p&gt;This isn&amp;rsquo;t really complicated but, again, manual toil.&lt;/p&gt;

&lt;p&gt;Obviously these aren&amp;rsquo;t real examples, but imagine doing this project again and again across a large set of configuration points, and you can imagine the experience.&lt;/p&gt;

&lt;h2 id="virtual-patch-panels"&gt;Virtual patch panels&lt;/h2&gt;

&lt;p&gt;It occurred to me that what we were building in &lt;code&gt;settings.yml&lt;/code&gt; was the virtual equivalent of a patch panel.&lt;/p&gt;

&lt;p&gt;&lt;a title="Don DeBold from San Jose, CA, USA, CC BY 2.0 &amp;lt;https://creativecommons.org/licenses/by/2.0&amp;gt;, via Wikimedia Commons" href="https://commons.wikimedia.org/wiki/File:Analog_Computer_patch_panel_(5900042053).jpg"&gt;&lt;img width="512" alt="Analog Computer patch panel (5900042053)" src="https://upload.wikimedia.org/wikipedia/commons/thumb/e/e4/Analog_Computer_patch_panel_%285900042053%29.jpg/512px-Analog_Computer_patch_panel_%285900042053%29.jpg?20190718083348"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The function of a patch panel is to allow you to route arbitrary outputs to arbitrary inputs. It is a sort of abstraction layer that allows for flexibility and configurability, while (hopefully) keeping the mess contained to a single zone.&lt;/p&gt;

&lt;p&gt;In this case, our virtual patch panel takes inputs from the env vars or from a set of predefined values and then routes them into a &lt;code&gt;settings&lt;/code&gt; table exposed to application developers. And that can be more or less of a huge mess. It takes discipline to keep it from becoming a huge mess.&lt;/p&gt;

&lt;p&gt;We&amp;rsquo;d like to keep this fundamentally messy and arbitrary zone as clean and understandable as possible.&lt;/p&gt;

&lt;h2 id="best-practices"&gt;Best practices&lt;/h2&gt;

&lt;p&gt;I started thinking about things to &lt;em&gt;not do&lt;/em&gt; in a configuration layer.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Don&amp;rsquo;t duplicate configuration code (see &lt;code&gt;allow_user_registration&lt;/code&gt;).&lt;/li&gt;
&lt;li&gt;Prefer flat configuration systems; try not to build inheritance trees of configuration when possible.&lt;/li&gt;
&lt;li&gt;Prefer consistency across contexts. If you need a similar configuration system in two separate services, make it as identical as you can. Standardization reduces cognitive load on maintainers and, so, reduces the risk of configuration errors.&lt;/li&gt;
&lt;li&gt;Don&amp;rsquo;t allow direct environment variable access from the codebase (as that is skipping what the &amp;ldquo;patch panel&amp;rdquo; is for). Keep all env var access centralized in one site, so it is instantly apparent to maintainers what env vars are required by the program. If this is impossible, at least minimize the number of places that access env vars.&lt;/li&gt;
&lt;li&gt;Don&amp;rsquo;t write to the env vars as if they were a convenient set of mutable global variables. (Exceptions may be allowable in test code, maybe, sometimes. Ugh.)&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id="coda"&gt;Coda&lt;/h2&gt;

&lt;p&gt;Recently we&amp;rsquo;ve started switching to a feature flag SaaS solution which costs a lot and nobody really loves. It has heavy process overhead from non-engineering staff, a confusing UX, limited licenses because of having high costs per seat, and generally all the modern luxuries we&amp;rsquo;ve come to expect.&lt;/p&gt;

&lt;p&gt;It is fine for its specific usecase (runtime evaluation of boolean flags) and it&amp;rsquo;s good for managing release-specific flags, but it&amp;rsquo;s not a complete replacement for &lt;code&gt;settings.yml&lt;/code&gt;. So now, of course, we heavily use both systems.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;#winning?&lt;/code&gt;&lt;/p&gt;
</content>
  </entry>
  <entry>
    <title>Coordinating Unicorn worker processes with flock</title>
    <link rel="alternate" href="/tech/2025/01/flock-coordination.html"/>
    <id>/tech/2025/01/flock-coordination.html</id>
    <published>2025-01-31T16:31:00+00:00</published>
    <updated>2025-01-31T16:31:00+00:00</updated>
    <author>
      <name>Eli Thorkelson</name>
    </author>
    <summary type="html">&lt;p&gt;Some of our current software runs on Unicorn, which if you aren’t the target audience for this post, is a process-based Ruby webserver that has:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://yhbt.net/unicorn/README.html"&gt;a very old-school website&lt;/a&gt;,&lt;/li&gt;
&lt;li&gt;a &lt;a href="https://github.com/defunkt/unicorn/"&gt;GitHub mirror&lt;/a&gt;,&lt;/li&gt;
&lt;li&gt;and an architecture based on a master process that forks...&lt;/li&gt;
&lt;/ul&gt;</summary>
    <content type="html">&lt;p&gt;Some of our current software runs on Unicorn, which if you aren&amp;rsquo;t the target audience for this post, is a process-based Ruby webserver that has:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://yhbt.net/unicorn/README.html"&gt;a very old-school website&lt;/a&gt;,&lt;/li&gt;
&lt;li&gt;a &lt;a href="https://github.com/defunkt/unicorn/"&gt;GitHub mirror&lt;/a&gt;,&lt;/li&gt;
&lt;li&gt;and an architecture based on a master process that forks a number of child worker processes to handle requests.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;We got interested lately in having exactly &lt;em&gt;one&lt;/em&gt; of a set of Unicorn workers spawn a background thread that would report some periodic healthcheck data. The idea was that the healthcheck results would be identical for all workers, so we only needed to report the data once per Unicorn master process. But we didn&amp;rsquo;t want to run a reporting thread on a master process, as it isn&amp;rsquo;t encouraged to fork a multithreaded process. (See for example Thorsten Ball&amp;rsquo;s &lt;a href="https://thorstenball.com/blog/2014/10/13/why-threads-cant-fork/"&gt;Why Threads Can&amp;rsquo;t Fork&lt;/a&gt;, rachelbythebay&amp;rsquo;s &lt;a href="https://rachelbythebay.com/w/2011/06/07/forked/"&gt;Don&amp;rsquo;t mix threads and forks&lt;/a&gt;, or more recently byroot&amp;rsquo;s &lt;a href="https://byroot.github.io/ruby/performance/2025/01/25/why-does-everyone-hate-fork.html"&gt;Why does everyone hate fork?&lt;/a&gt;).&lt;/p&gt;

&lt;p&gt;As the &lt;a href="https://man7.org/linux/man-pages/man2/fork.2.html"&gt;fork(2) manpage&lt;/a&gt; explains, when you &lt;code&gt;fork&lt;/code&gt;, all threads except the active thread will die and not get resumed in the child process:&lt;/p&gt;

&lt;pre&gt;&lt;code class="txt"&gt;- The child process is created with a single thread—the one that
called fork().  The entire virtual address space of the parent
is replicated in the child, including the states of mutexes,
condition variables, and other pthreads objects; the use of
pthread_atfork(3) may be helpful for dealing with problems
that this can cause.

- After a fork() in a multithreaded program, the child can
safely call only async-signal-safe functions (see
signal-safety(7)) until such time as it calls execve(2).
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Stopping all background threads when forking might be what you want, in some cases, but it can leave resources dangling, connections unreleased, and so on, depending what was happening in the other threads at the time.&lt;/p&gt;

&lt;p&gt;Anyway - if we don&amp;rsquo;t want to report healthcheck data from the &lt;code&gt;master&lt;/code&gt; process, and we want to report it from only one of &lt;em&gt;n&lt;/em&gt; worker processes, then this raises an interesting interprocess coordination problem.&lt;/p&gt;

&lt;p&gt;How can you guarantee that out of a a pool of &lt;em&gt;n&lt;/em&gt; workers, exactly one will run a given observability task at any given time? And how can you guarantee that if one worker dies, another will automatically start running the observability task?&lt;/p&gt;

&lt;p&gt;It kind of reminds me of Zookeeper - a cluster coordination problem - except that in this case, we aren&amp;rsquo;t trying to coordinate processes across a whole cluster; we are only trying to coordinate processes within a particular container.&lt;/p&gt;

&lt;h2 id="naive-approach"&gt;Naive approach&lt;/h2&gt;

&lt;p&gt;The first thing that occurred to me was this:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;At boot time, each child process will check for the existence of a file at a standard path (let&amp;rsquo;s say &lt;code&gt;/tmp/coordination.pid&lt;/code&gt;).&lt;/li&gt;
&lt;li&gt;If &lt;code&gt;/tmp/coordination.pid&lt;/code&gt; is not found, then create it, and write the current pid to it. Whichever process does this first is volunteering to run the healthcheck task.&lt;/li&gt;
&lt;li&gt;If &lt;code&gt;/tmp/coordination.pid&lt;/code&gt; is already present, then check if a process with that pid is running.

&lt;ul&gt;
&lt;li&gt;If so, then sleep for a while and then check again.&lt;/li&gt;
&lt;li&gt;If not, then proceed from step 2 as if the file were not found.&lt;/li&gt;
&lt;/ul&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Problems with this approach:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;There is some chance of a race condition in between steps 1 and 2, wherein two processes simultaneously find that &lt;code&gt;/tmp/coordination.pid&lt;/code&gt; is absent and then each try to write their pids to the same path. The chances of this could perhaps be mitigated by waiting for a random interval before attempting step 1.&lt;/li&gt;
&lt;li&gt;For the numerous workers that are sleeping, it&amp;rsquo;s inefficient that they have to wake up every so often to recheck step 3. This imposes a pointless polling cost.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;My colleague &lt;a href="https://dmytro.sh/"&gt;Dmytro&lt;/a&gt; suggested that we use &lt;code&gt;flock&lt;/code&gt; instead, which essentially delegates the whole coordination problem to the operating system and solves both of these problems.&lt;/p&gt;

&lt;p&gt;I had never heard of it before.&lt;/p&gt;

&lt;h2 id="flock-2"&gt;Flock(2)&lt;/h2&gt;

&lt;p&gt;I found &lt;code&gt;flock&lt;/code&gt; hard to learn about. There are manpages (&lt;a href="https://man7.org/linux/man-pages/man2/flock.2.html"&gt;flock(2)&lt;/a&gt;) and &lt;a href="https://news.ycombinator.com/item?id=35881520"&gt;Hacker News discussions&lt;/a&gt;, but they don&amp;rsquo;t cover the set of &lt;em&gt;use cases&lt;/em&gt; for file locking very clearly. I think the core use case is &amp;ldquo;several processes want to write to the same shared file and need to cooperate with each other.&amp;rdquo;&lt;/p&gt;

&lt;p&gt;In any case, it is a system call that comes with some caveats. The first two I found:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The relationship between file descriptors and file locks is slightly confusing to me in the context of forks.&lt;/li&gt;
&lt;li&gt;Per &lt;a href="https://utcc.utoronto.ca/~cks/space/blog/linux/FlockFcntlAndNFS"&gt;this discussion&lt;/a&gt;, flock is handled poorly over NFS, although I don&amp;rsquo;t think that edge case is very relevant to our kubernetes cluster. Fortunately, the edge case pertains to how flock is handled differently for processes running on the NFS server itself than for NFS clients, so even if our ops team started to run NFS for our web workers without telling me, the edge case would not affect us.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In any event, &lt;code&gt;flock&lt;/code&gt; can nicely be used to coordinate only-once semantics among a set of worker processes. The way it works for our use case is this:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Open a file at a standard path.&lt;/li&gt;
&lt;li&gt;Attempt to acquire an exclusive lock (&lt;code&gt;LOCK_EX&lt;/code&gt;) on the file. Use the blocking form of &lt;code&gt;flock&lt;/code&gt; that just blocks the caller until the lock can be acquired.&lt;/li&gt;
&lt;li&gt;If you acquire the lock, then you can go ahead and run your instrumentation task, or whatever only-once activity you want to conduct.&lt;/li&gt;
&lt;li&gt;When the process that is currently holding the lock eventually exits or is killed, the operating system will automatically wake up the next process in line and give them the lock. You never have to poll, in this approach.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id="ruby-implementation"&gt;Ruby implementation&lt;/h2&gt;

&lt;p&gt;Ruby provides a standard (though platform-dependent) interface to &lt;code&gt;flock&lt;/code&gt;, available at &lt;a href="https://docs.ruby-lang.org/en/3.4/File.html#method-i-flock"&gt;File#lock&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;One can write an implementation roughly like this in a &lt;a href="https://github.com/defunkt/unicorn/blob/master/examples/unicorn.conf.rb"&gt;Unicorn configuration file&lt;/a&gt;:&lt;/p&gt;

&lt;pre&gt;&lt;code class="ruby"&gt;TMP_FILE_PATH = &amp;quot;/tmp/coordination.pid&amp;quot;

after_fork do |server, worker|
  Thread.new do
    File.open(TMP_FILE_PATH, File::RDWR | File::CREAT, 0644) do |f|
      f.flock(File::LOCK_EX) # will block indefinitely if the lock is not acquired

      # now run whatever background task you want here, such as reporting system health.
    end
  end
end
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;So far, this has worked quite well for us, and it seems likely to be much more robust than any DIY solution I could have come up with.&lt;/p&gt;

&lt;h2 id="further-reading"&gt;Further reading&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://github.com/torvalds/linux/blob/master/fs/locks.c"&gt;kernel source &amp;gt; fs/locks.c&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/torvalds/linux/commit/fd7732e033e30b3a586923b57e338c859e17858a"&gt;interesting kernel commit that introduces a tree of dependent requests for a given file lock&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;(Standard disclaimer: I am absolutely not an expert on the Linux kernel, although I do enjoy trying to read the source code from time to time.)&lt;/p&gt;
</content>
  </entry>
  <entry>
    <title>How to instrument DNS lookups in Ruby</title>
    <link rel="alternate" href="/tech/2024/12/instrument-dns-lookups.html"/>
    <id>/tech/2024/12/instrument-dns-lookups.html</id>
    <published>2024-12-19T06:29:00+00:00</published>
    <updated>2024-12-19T06:29:00+00:00</updated>
    <author>
      <name>Eli Thorkelson</name>
    </author>
    <summary type="html">&lt;p&gt;One of the things we did at work, after our strange &lt;a href="/tech/2024/09/a-confusing-issue.html"&gt;performance issue in September&lt;/a&gt;, was to add some instrumentation to DNS lookups in Ruby.&lt;/p&gt;

&lt;p&gt;As I mentioned last time, it isn’t easy to get performance data directly from &lt;code&gt;getaddrinfo&lt;/code&gt; in Linux, which is...&lt;/p&gt;</summary>
    <content type="html">&lt;p&gt;One of the things we did at work, after our strange &lt;a href="/tech/2024/09/a-confusing-issue.html"&gt;performance issue in September&lt;/a&gt;, was to add some instrumentation to DNS lookups in Ruby.&lt;/p&gt;

&lt;p&gt;As I mentioned last time, it isn&amp;rsquo;t easy to get performance data directly from &lt;code&gt;getaddrinfo&lt;/code&gt; in Linux, which is the &lt;a href="https://man7.org/linux/man-pages/man3/getaddrinfo.3.html"&gt;libc function that does DNS lookups&lt;/a&gt;. You would have to have a local DNS proxy service running and do the logging from there, which isn&amp;rsquo;t the recommended configuration for our Kubernetes-based system, as far as I understand.&lt;/p&gt;

&lt;p&gt;Indirectly, this also means that you can&amp;rsquo;t easily instrument DNS lookups from C extensions in Ruby, if they call &lt;code&gt;getaddrinfo&lt;/code&gt; directly. It probably is not worthwhile to fork every library you use and add your own custom instrumentation code.&lt;/p&gt;

&lt;p&gt;However, some libraries with C extensions still use Ruby for the DNS lookups, as Ruby 3.1 &lt;a href="https://bugs.ruby-lang.org/issues/17370"&gt;provided helpful concurrency handling&lt;/a&gt; for the address lookup using &lt;a href="https://blog.saeloun.com/2022/03/01/ruby-fibers-101/"&gt;Fibers&lt;/a&gt;. Fortunately, this happens to include our database adapter, &lt;a href="https://github.com/ged/ruby-pg/blob/c6d1047b419e87e57aaec076d9338e80f9bd269a/lib/pg/connection.rb#L841"&gt;pg&lt;/a&gt;).&lt;/p&gt;

&lt;p&gt;This means that we can add helpful instrumentation directly in Ruby. It&amp;rsquo;s easy to measure all DNS lookups that use &lt;a href="https://ruby-doc.org/3.3.6/exts/socket/Addrinfo.html#method-c-getaddrinfo"&gt;Addrinfo.getaddrinfo&lt;/a&gt;.&lt;/p&gt;

&lt;h2 id="sampling-methodology"&gt;Sampling methodology&lt;/h2&gt;

&lt;p&gt;Methodologically speaking, if you are trying to instrument your DNS lookup calls, it doesn&amp;rsquo;t necessarily matter whether you capture every request. Even if you &lt;em&gt;can&lt;/em&gt; manage to do that, you might not &lt;em&gt;want&lt;/em&gt; to, as instrumentation has a runtime (and resource) cost. Particularly if you have a high volume of DNS lookups, you might want to use a random sampling approach, and only instrument X percent of DNS requests.&lt;/p&gt;

&lt;p&gt;The approach I came up with was this:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Instrument a small percentage of all DNS lookups (like 2%).&lt;/li&gt;
&lt;li&gt;Also, instrument all DNS lookups that are slower than a certain threshold (say 100ms, if you are expecting fast local DNS resolution).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you record those two data streams separately, then you can get a good sense of aggregate performance, &lt;em&gt;plus&lt;/em&gt; good coverage of the worst case scenario.&lt;/p&gt;

&lt;h2 id="implementation"&gt;Implementation&lt;/h2&gt;

&lt;p&gt;Monkey patching the Ruby standard library is something to approach cautiously. It has some risk of breaking shared functionality and of adding extra request latency (if the instrumentation layer takes any time).&lt;/p&gt;

&lt;pre&gt;&lt;code class="ruby"&gt;require &amp;quot;socket&amp;quot;

module DNSInstrumentation
  SAMPLING_RATE = 0.01
  SLOW_QUERY_REPORTING_THRESHOLD = 0.15

  def getaddrinfo(...)
    start_at = Process.clock_gettime(Process::CLOCK_MONOTONIC)

    result = super

    duration = Process.clock_gettime(Process::CLOCK_MONOTONIC) - start_at

    if duration &amp;gt; SLOW_QUERY_REPORTING_THRESHOLD || Kernel.rand &amp;lt; SAMPLING_RATE
      instrument(:dns_query, duration)
    end

    result
  end
end

# Apply the patch

# At boot time:
Addrinfo.singleton_class.prepend(DNSInstrumentation) if dns_instrumentation_enabled?
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;(Side note: It drives me slightly crazy that we have to resolve our database server&amp;rsquo;s IP address for every request, since for tenant partitioning reasons we have to establish a new database connection on every request and &lt;code&gt;pg&lt;/code&gt; will look up the database server&amp;rsquo;s IP from its hostname every single time&amp;hellip; There are ways around this, like pinning tenants to specific web workers or adding application-level DNS caching, but we aren&amp;rsquo;t currently looking into them.)&lt;/p&gt;

&lt;p&gt;Suppose your &lt;code&gt;instrument&lt;/code&gt; method just prints to stdout:&lt;/p&gt;

&lt;pre&gt;&lt;code class="ruby"&gt;def instrument(type, duration)
  puts &amp;quot;DNS query took: #{duration}&amp;quot;
end
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Then you can see results like this:&lt;/p&gt;

&lt;pre&gt;&lt;code class="ruby"&gt;irb(main):008&amp;gt; Addrinfo.getaddrinfo(&amp;quot;www.decasia.org&amp;quot;, 80)
DNS query took: 0.048221999779343605
=&amp;gt;
[#&amp;lt;Addrinfo: [2606:4700:3031::6815:51e0]:80 UDP (www.decasia.org)&amp;gt;,
 #&amp;lt;Addrinfo: [2606:4700:3031::6815:51e0]:80 TCP (www.decasia.org)&amp;gt;,
 #&amp;lt;Addrinfo: [2606:4700:3037::ac43:a564]:80 UDP (www.decasia.org)&amp;gt;,
 #&amp;lt;Addrinfo: [2606:4700:3037::ac43:a564]:80 TCP (www.decasia.org)&amp;gt;,
 #&amp;lt;Addrinfo: 172.67.165.100:80 UDP (www.decasia.org)&amp;gt;,
 #&amp;lt;Addrinfo: 172.67.165.100:80 TCP (www.decasia.org)&amp;gt;,
 #&amp;lt;Addrinfo: 104.21.81.224:80 UDP (www.decasia.org)&amp;gt;,
 #&amp;lt;Addrinfo: 104.21.81.224:80 TCP (www.decasia.org)&amp;gt;]
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;(48ms is slow, but the second time I look up the same hostname, it takes only 1.7ms, presumably because MacOS — unlike Linux — does run a local DNS cache by default.)&lt;/p&gt;

&lt;h2 id="commentary"&gt;Commentary&lt;/h2&gt;

&lt;p&gt;There&amp;rsquo;s nothing very technically remarkable about this instrumentation strategy, except that I didn&amp;rsquo;t realize we could do this until we did it.&lt;/p&gt;

&lt;p&gt;Like a lot of things in software - you aren&amp;rsquo;t sure in advance if something is possible, and then when you need to know it, you just have to figure it out at the time.&lt;/p&gt;

&lt;p&gt;On-demand knowledge, one could call it.&lt;/p&gt;
</content>
  </entry>
  <entry>
    <title>True BASIC/4d/Star Pattern/01</title>
    <link rel="alternate" href="/tech/2024/12/truebasic.html"/>
    <id>/tech/2024/12/truebasic.html</id>
    <published>2024-12-04T20:39:00+00:00</published>
    <updated>2024-12-04T20:39:00+00:00</updated>
    <author>
      <name>Eli Thorkelson</name>
    </author>
    <summary type="html">&lt;p&gt;I went looking for the source code from my first ever C program (a screen saver). I couldn’t find it, but instead I found the homework exercises from my first ever computer programming class in high school. They are all written in a BASIC flavor called...&lt;/p&gt;</summary>
    <content type="html">&lt;p&gt;I went looking for the source code from my first ever C program (a screen saver). I couldn&amp;rsquo;t find it, but instead I found the homework exercises from my first ever computer programming class in high school. They are all written in a BASIC flavor called &lt;a href="https://en.wikipedia.org/wiki/True_BASIC"&gt;True BASIC&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;This class was taught by an aging math teacher who liked playing bridge and Go with his students after school - he was kind of a classic late-20th-century US nerd figure, with glasses and maybe, like, a pocket protector. I guess he&amp;rsquo;s probably not still around anymore, which is sad to realize. He always liked me and encouraged me to keep learning about computers. I didn&amp;rsquo;t really understand that I was lucky to get that kind of encouragement.&lt;/p&gt;

&lt;p&gt;I remember this class being way too easy for me at the time. The most complicated thing in it was a quicksort implementation.&lt;/p&gt;

&lt;p&gt;A more typical exercise was this:&lt;/p&gt;

&lt;pre&gt;&lt;code class="basic"&gt;REM Program: True BASIC/4d/Star Pattern/01         5/5
REM Purpose: To print a pattern of stars:
REM                *
REM                **
REM                ***
REM                ****
REM              for 40 rows and then go in reverse:
REM                40 *s
REM                39 *s
REM Author: Eli Thorkelson
REM Date: 14 Mar 1997
REM These are the opening statements
PRINT &amp;quot;                        StarPattern 1.0&amp;quot;
PRINT &amp;quot;                       By Eli Thorkelson&amp;quot;
PRINT &amp;quot;&amp;quot;
PRINT &amp;quot;&amp;quot;
PRINT &amp;quot;                    Press any key when ready&amp;quot;
GET KEY useless                   !This variable is useless
REM Beginning of the first FOR-NEXT loop
REM           This loop prints the first half of the *s
FOR step1 = 1 to 40
    PRINT &amp;quot;&amp;quot;                      ! This moves to the next line
    FOR step2 = 1 to step1
        PRINT &amp;quot;*&amp;quot;;
    NEXT step2
NEXT step1
REM Beginning of the second FOR-NEXT loop
REM           This loop prints the second half of the *s
FOR step1 = 40 to 1 step -1
    PRINT &amp;quot;&amp;quot;                      ! Moves to the next line
    FOR step2 = 1 to step1
        PRINT &amp;quot;*&amp;quot;;
    NEXT step2
NEXT step1
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;I haven&amp;rsquo;t looked at this language in such a long time. I like the nested loop implementations. It&amp;rsquo;s both clumsy and oddly easy to read.&lt;/p&gt;

&lt;p&gt;My favorite line, clearly, is &lt;code&gt;GET KEY useless !This variable is useless&lt;/code&gt;, although it should really have read &lt;code&gt;!This comment is useless&lt;/code&gt; since the variable name is already self-documenting. Perhaps &lt;code&gt;unused&lt;/code&gt; would have been a better name than &lt;code&gt;useless&lt;/code&gt;, though.&lt;/p&gt;

&lt;p&gt;I tried to figure out how to execute this program. I managed to get some of it to run on &lt;a href="https://qbjs.org/"&gt;qbjs.org&lt;/a&gt;. I had to shorten it to ten lines instead of 40, and remove the confirmation step.&lt;/p&gt;

&lt;pre&gt;&lt;code class="txt"&gt;                        StarPattern 1.0
                       By Eli Thorkelson


*
**
***
****
*****
******
*******
********
*********
**********
**********
*********
********
*******
******
*****
****
***
**
*
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;I honestly have no idea why this was even a homework exercise. I guess it assesses whether we could write nested loops. Or maybe our teacher just liked seeing stars.&lt;/p&gt;
</content>
  </entry>
  <entry>
    <title>DIY unit tests</title>
    <link rel="alternate" href="/tech/2024/11/diy-unit-tests.html"/>
    <id>/tech/2024/11/diy-unit-tests.html</id>
    <published>2024-11-20T19:34:00+00:00</published>
    <updated>2024-11-20T19:34:00+00:00</updated>
    <author>
      <name>Eli Thorkelson</name>
    </author>
    <summary type="html">&lt;p&gt;Earlier this year, I got &lt;a href="https://decasia.org/tech/2024/01/the-perils-of-ipv6.html"&gt;rid of my virtual private server&lt;/a&gt; and migrated the content to a static hosting system. In the process, I broke an old web project that had a submission form.&lt;/p&gt;

&lt;p&gt;The submission form used to be handled by a small Go application...&lt;/p&gt;</summary>
    <content type="html">&lt;p&gt;Earlier this year, I got &lt;a href="https://decasia.org/tech/2024/01/the-perils-of-ipv6.html"&gt;rid of my virtual private server&lt;/a&gt; and migrated the content to a static hosting system. In the process, I broke an old web project that had a submission form.&lt;/p&gt;

&lt;p&gt;The submission form used to be handled by a small Go application that accepted form input, validated it, persisted it to a SQLite database, and then wrote new JSON data for the public website. Obviously, that broke when I got rid of the server where the application was hosted.&lt;/p&gt;

&lt;p&gt;I decided to rewrite the form handler as an AWS Lambda function that ingested JSON and updated some S3 content. I hadn&amp;rsquo;t ever written a Lambda function before, and I decided I should try to deploy it with Terraform, which I also haven&amp;rsquo;t touched, so there was a slightly painful learning curve. It took an evening to get it all working again.&lt;/p&gt;

&lt;p&gt;You can write Lambda functions in Ruby, Golang, Java, and Python, among others, but I decided to write mine in Node, since it seems like the quintessential &amp;ldquo;serverless&amp;rdquo; language.&lt;/p&gt;

&lt;p&gt;I decided I had to have unit tests for my lambda function. But as far as I know, Node doesn&amp;rsquo;t come with a built-in test framework.* People generally use third party test frameworks for Node apps.&lt;/p&gt;

&lt;p&gt;And the problem is, I didn&amp;rsquo;t really feel like installing a new dependency just to write some unit tests for one single lambda function. Dependencies are debts. Better to avoid them if you don&amp;rsquo;t really need them.&lt;/p&gt;

&lt;p&gt;So I wrote my own tiny test framework.&lt;/p&gt;

&lt;h2 id="a-diy-test-framework"&gt;A DIY test framework&lt;/h2&gt;

&lt;p&gt;It is structured as follows.&lt;/p&gt;

&lt;p&gt;It does this at the top of the file:&lt;/p&gt;

&lt;pre&gt;&lt;code class="js"&gt;import {testables} from &amp;#39;./index.mjs&amp;#39;; // A map of functions to be tested

const testSuite = []; // A list of tests - a &amp;quot;suite&amp;quot; if you will

const test = (functionUnderTest, implementation) =&amp;gt; {
  testSuite.push(
    {
      name: functionUnderTest, // the name of the function you want to test
      f: testables[functionUnderTest],
      implementation: implementation // test implementation
    }
  );
};
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;It then provides you two simple assertions.&lt;/p&gt;

&lt;pre&gt;&lt;code class="js"&gt;
const assertEq = (value, expect, eq = (a, b) =&amp;gt; { return a == b }) =&amp;gt; {
  if (!eq(value, expect)) {
    console.error(`  ❌ ${inspect(value)} != ${inspect(expect)}`);
    process.exit(1);
  } else {
    console.info(`  ✅ ${value} == ${expect}`);
  }
}

const assertJsonEq = (value, expect) =&amp;gt; {
  return assertEq(
    value,
    expect,
    (a, b) =&amp;gt; { return JSON.stringify(a) == JSON.stringify(b) }
  );
}
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;These will &lt;code&gt;exit(1)&lt;/code&gt; and print an error whenever the test fails. I didn&amp;rsquo;t need anything more than this.&lt;/p&gt;

&lt;p&gt;Finally, at the bottom of the file, there&amp;rsquo;s a test runner.&lt;/p&gt;

&lt;pre&gt;&lt;code class="js"&gt;const main = async () =&amp;gt; {
  for (let t of testSuite) {
    console.info(`Testing ${t.name}...`);
    await t.implementation(t.f);
  }

  console.info(&amp;#39;\n🦄 Test suite successful 🦄&amp;#39;);
};

main();
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;If you invoke this with &lt;code&gt;node test.mjs&lt;/code&gt;, it runs all the tests in order, exiting with an error as soon as any test fails.&lt;/p&gt;

&lt;p&gt;An example test declaration looks like this:&lt;/p&gt;

&lt;pre&gt;&lt;code class="js"&gt;test(&amp;#39;beep&amp;#39;, (f) =&amp;gt; {
  assertEq(f(1), &amp;quot;beep&amp;quot;);
  assertEq(f(2), &amp;quot;boop&amp;quot;);
});
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;This just asserts that &lt;code&gt;f(1)&lt;/code&gt; equals &lt;code&gt;&amp;quot;beep&amp;quot;&lt;/code&gt;, and &lt;code&gt;f(2)&lt;/code&gt; equals &lt;code&gt;&amp;quot;boop&amp;quot;&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;The test implementation accepts one argument &lt;code&gt;f&lt;/code&gt;, which will be set to the function called &lt;code&gt;beep&lt;/code&gt; from the original module. We retrieve the &lt;code&gt;beep()&lt;/code&gt; function from the &lt;code&gt;testables&lt;/code&gt; module export that we imported at the top. (This all presumes you are using ES6 modules or &lt;a href="https://developer.mozilla.org/en-US/docs/Web/JavaScript/Guide/Modules"&gt;whatever they&amp;rsquo;re called now&lt;/a&gt;.)&lt;/p&gt;

&lt;h2 id="discussion"&gt;Discussion&lt;/h2&gt;

&lt;p&gt;I was pretty happy with this amount of test tooling for the tiny personal project I was using. For my purpose, it did an excellent job - in particular, it got the lambda function ready for deployment with relatively high confidence that it would work. And it did deploy with minimal issues (it had a few IAM issues, and one function signature issue that the unit tests missed).&lt;/p&gt;

&lt;p&gt;I&amp;rsquo;m not saying this test framework is good enough for a large project. It always fails fast, which isn&amp;rsquo;t necessarily what you want. It doesn&amp;rsquo;t provide multiple output formats, flexible test naming, complex matchers, test randomization, timing data, mocks and stubs, memoized test data setup, or much else.&lt;/p&gt;

&lt;p&gt;However, by sacrificing a lot of what you usually get from a test framework, it also gains a few things:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;It has a very clean, readable DSL (in my opinion).&lt;/li&gt;
&lt;li&gt;It has zero dependencies.&lt;/li&gt;
&lt;li&gt;It is very quick to run.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;But most importantly, it reminds me that &lt;strong&gt;you don&amp;rsquo;t always need a framework&lt;/strong&gt;. The concepts here are so straightforward that for a simple use case, you can write your own assertions and your own runner, and it basically just works.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Note 1: Node does come with an &lt;a href="https://nodejs.org/docs/latest-v16.x/api/assert.html"&gt;assertion framework in the standard library&lt;/a&gt;, even if not a full test framework. I would probably use their assertions next time, instead of the DIY assertions.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Note 2: In terms of dependencies, I did end up adding one AWS mocking library, &lt;a href="https://github.com/m-radzikowski/aws-sdk-client-mock"&gt;aws-sdk-client-mock&lt;/a&gt;, to be able to test functions that used the AWS SDK. But this isn&amp;rsquo;t essential to the basic DIY test framework here.&lt;/em&gt;&lt;/p&gt;
</content>
  </entry>
  <entry>
    <title>Reading programming languages you don't understand</title>
    <link rel="alternate" href="/tech/2024/11/reading-unknown-languages.html"/>
    <id>/tech/2024/11/reading-unknown-languages.html</id>
    <published>2024-11-10T07:45:00+00:00</published>
    <updated>2024-11-10T07:45:00+00:00</updated>
    <author>
      <name>Eli Thorkelson</name>
    </author>
    <summary type="html">&lt;p&gt;Sometimes you want to read code you can’t understand. Sometimes you want to read &lt;em&gt;languages&lt;/em&gt; you’ve never learned.&lt;/p&gt;

&lt;p&gt;That seems like it should be impossible. No one can speak every language. There are limits to understanding.&lt;/p&gt;

&lt;p&gt;The analogy with human languages...&lt;/p&gt;</summary>
    <content type="html">&lt;p&gt;Sometimes you want to read code you can&amp;rsquo;t understand. Sometimes you want to read &lt;em&gt;languages&lt;/em&gt; you&amp;rsquo;ve never learned.&lt;/p&gt;

&lt;p&gt;That seems like it should be impossible. No one can speak every language. There are limits to understanding.&lt;/p&gt;

&lt;p&gt;The analogy with human languages isn&amp;rsquo;t encouraging. If you try to understand a human language you&amp;rsquo;ve never learned, it&amp;rsquo;s often utterly unintelligible. And acquiring a new human language can be a very slow process, often taking years.&lt;/p&gt;

&lt;p&gt;I&amp;rsquo;m probably not going to spend years learning a new programming language just to read a random codebase. Nevertheless, there are lots of technical systems I want to understand at work, and even when I don&amp;rsquo;t have much other context, I can usually find their git repositories. So I open up their source code to see what I find, even if it&amp;rsquo;s not in a programming language I know.&lt;/p&gt;

&lt;p&gt;I&amp;rsquo;ve tried this out recently with Elixir, Kotlin, and Scala. And what surprises me is that, contrary to my intuitions that &lt;em&gt;this should not work&lt;/em&gt;, I often learn something from reading.&lt;/p&gt;

&lt;h2 id="example-where-does-this-configuration-value-come-from"&gt;Example: Where does this configuration value come from?&lt;/h2&gt;

&lt;p&gt;Here&amp;rsquo;s a real life example.&lt;/p&gt;

&lt;p&gt;We&amp;rsquo;d like to update a configuration value in some other system. I&amp;rsquo;d like know how the configuration value is set in the first place, so we can update what needs updating.&lt;/p&gt;

&lt;p&gt;I would probably start by trying to find the calling site where the configuration is used. A controller class in a web service, perhaps. From there, I can look for the origins of whatever value I&amp;rsquo;m trying to trace. Then I can follow that object (variable, parameter, function, etc) back through the codebase to see if I can arrive at a point of origin.&lt;/p&gt;

&lt;p&gt;Last time I tried this exercise, I did find the right answer, even though the codebase was in Scala. (Then I checked my answer with an expert to confirm.)&lt;/p&gt;

&lt;p&gt;In a case like this, we know up front that there are only so many possible sources of initial state for a program: configuration files, external systems, constants, user input, and so on. Even if we don&amp;rsquo;t know the language, we have a sense of the problem space.&lt;/p&gt;

&lt;h2 id="concretely-how-do-you-read-unknown-languages"&gt;Concretely, how do you read unknown languages?&lt;/h2&gt;

&lt;p&gt;This isn&amp;rsquo;t a very satisfactory answer, but I think you just kind of&amp;hellip; do your best.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Scan source code&lt;/li&gt;
&lt;li&gt;Attempt to parse, liberally using your knowledge of other languages, ignoring unfamiliar symbols or syntax&lt;/li&gt;
&lt;li&gt;Trace execution flow via function names and variable name flow.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This is a problematic reading strategy, and we know that up front. But understanding is a gradient, after all. You can get farther along, or you can get stuck quickly. And depending on the case, there can be greater or lesser degrees of interpretive uncertainty. You understand, but &lt;em&gt;with an asterisk&lt;/em&gt;; you acknowledge where you&amp;rsquo;re guessing. &amp;ldquo;I understand this bit *unless the &lt;code&gt;|&lt;/code&gt; operator does something totally unexpected.&amp;rdquo; That kind of thing.&lt;/p&gt;

&lt;p&gt;A lot of programming languages have similar syntax and semantics. And the program structure can be apparent, from naming and file organizations, even if you don&amp;rsquo;t understand exactly how the language works.&lt;/p&gt;

&lt;p&gt;(I suppose at this point in history, it would probably work well to ask an LLM to explain a codebase to me. I suppose I will try it. I doubt it will be honest with me about the parts it isn&amp;rsquo;t sure about, though.)&lt;/p&gt;

&lt;h2 id="why-not-just-ask-for-help"&gt;Why not just ask for help?&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;There&amp;rsquo;s nothing wrong with asking someone for help&lt;/em&gt;, needless to say. Often asking is necessary; often it&amp;rsquo;s much faster than any alternative.&lt;/p&gt;

&lt;p&gt;However, there are reasons not to always ask &lt;em&gt;first&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;Asking can be slow. Implicitly, it can cost you something, in some annoying social-capital sense. It feels uncharitable to interrupt someone else to get answers you could trivially find yourself.&lt;/p&gt;

&lt;p&gt;So I ask questions all the time, but I don&amp;rsquo;t usually ask before trying to self-solve. If the answer becomes clear from my own research, that&amp;rsquo;s great. If not, I can ask for help with a little more context than I had at first.&lt;/p&gt;

&lt;p&gt;This gets us to the higher order reasons to try to do your own research first. Reading other teams&amp;rsquo; codebases can help to build relationships with them. If I understand something about their systems, I might be able to have better conversations, starting from a point of greater understanding. &amp;ldquo;I looked at ABC, and I think your system handles X by doing Y; is that right?&amp;rdquo; — I think people usually respond better questions like this, as they give a clearer point of departure.&lt;/p&gt;

&lt;p&gt;And down the road, it usually comes in handy to know something about other parts of the technical ecosystem I work in.&lt;/p&gt;

&lt;h2 id="the-illusion-of-not-understanding"&gt;The illusion of not understanding&lt;/h2&gt;

&lt;p&gt;Sometimes understanding is portrayed as more binary than it really is. &amp;ldquo;Either you understand something or you don&amp;rsquo;t.&amp;rdquo; But the truth is, partial understanding is often all we have.&lt;/p&gt;

&lt;p&gt;This is exacerbated in the case of language use, I think. There&amp;rsquo;s so much shaming around our understanding of languages. It&amp;rsquo;s risky to say you understand something, and then miss something that an expert would catch, and then get put down for it.&lt;/p&gt;

&lt;p&gt;It&amp;rsquo;s exacerbated in software, too, where the economy likes to put us in boxes. You have to list your programming languages on your CV and it&amp;rsquo;s presumed you can&amp;rsquo;t use anything else.&lt;/p&gt;

&lt;p&gt;Meanwhile, everything I&amp;rsquo;ve learned from reading unknown programming languages points away from these dogmas.&lt;/p&gt;

&lt;p&gt;Understanding is a gradient, and there are ways of making uncertainty work for us, rather than undermining us.&lt;/p&gt;
</content>
  </entry>
  <entry>
    <title>Golang, flow state, and Copilot</title>
    <link rel="alternate" href="/tech/2024/10/golang-flow-state.html"/>
    <id>/tech/2024/10/golang-flow-state.html</id>
    <published>2024-10-29T19:49:00+00:00</published>
    <updated>2024-10-29T19:49:00+00:00</updated>
    <author>
      <name>Eli Thorkelson</name>
    </author>
    <summary type="html">&lt;p&gt;Lately I started to get that excellent sense of &lt;em&gt;flow state&lt;/em&gt; while working on my Golang project.&lt;/p&gt;

&lt;p&gt;The flow state of programming is such a nice feeling, when it happens. I don’t get to experience it as much as I used to, because so much of my work currently...&lt;/p&gt;</summary>
    <content type="html">&lt;p&gt;Lately I started to get that excellent sense of &lt;em&gt;flow state&lt;/em&gt; while working on my Golang project.&lt;/p&gt;

&lt;p&gt;The flow state of programming is such a nice feeling, when it happens. I don&amp;rsquo;t get to experience it as much as I used to, because so much of my work currently consists of meetings, plans, Slack messages, and things that don&amp;rsquo;t involve long periods of focus.&lt;/p&gt;

&lt;p&gt;And when I first started writing Go for this project, I found it a little tedious. Go has a deliberately limited vocabulary and a lot of boilerplate error handling.&lt;/p&gt;

&lt;p&gt;But after a few solid days, I started to really enjoy writing Go. It&amp;rsquo;s not as elegant as Swift (the other strongly typed language I&amp;rsquo;ve played with lately). But it doesn&amp;rsquo;t have the &lt;a href="https://danielchasehooper.com/posts/why-swift-is-slow/"&gt;horrible type inference slowdowns&lt;/a&gt; either. It&amp;rsquo;s more like writing C, but with a much better standard library, fewer semicolons, and a better DX.&lt;/p&gt;

&lt;p&gt;What&amp;rsquo;s good about writing C, and also about writing Go, is this: &lt;em&gt;It&amp;rsquo;s all functions and structs.&lt;/em&gt; The building blocks are very uniform. If you can write a good function, with a sensible name, and compose it nicely with other functions, using sensible data structures, then the program almost starts to write itself. It has a kind of minimalist harmony.&lt;/p&gt;

&lt;p&gt;The flow is mostly this: a flow of writing functions, specifying data structures, extracting shared functionality, arranging a project into sensible packages, and testing.&lt;/p&gt;

&lt;p&gt;And quite often, if my Go program can compile, then the behavior is fundamentally correct out of the box. I do write tests too, and I test interactively. But it really improves flow state and productivity to know that, &lt;em&gt;if my code compiles&lt;/em&gt;, then there&amp;rsquo;s an excellent chance it&amp;rsquo;s already in working order.&lt;/p&gt;

&lt;p&gt;I started to ignore the boilerplate, too. I used to be annoyed by all the &lt;code&gt;if err != nil&lt;/code&gt; blocks, but lately I just &amp;hellip; don&amp;rsquo;t pay them much attention.&lt;/p&gt;

&lt;h2 id="notes-on-using-copilot"&gt;Notes on using Copilot&lt;/h2&gt;

&lt;p&gt;It&amp;rsquo;s the first time I&amp;rsquo;ve ever used Copilot with a new programming language.&lt;/p&gt;

&lt;p&gt;I&amp;rsquo;m more excited currently about the code autocompletion than about the Copilot chat. The chat has lackluster performance and a bad UX for applying changes. The autocompletion is &lt;em&gt;quite good&lt;/em&gt; and saves me a bunch of typing. It tends to be fast at noticing common patterns and wanting to apply them. It&amp;rsquo;s sometimes incorrect, of course, but I always re-read and edit the output as soon as I hit tab.&lt;/p&gt;

&lt;p&gt;Sometimes I use the Copilot chat to basically substitute for things I would Google. &amp;ldquo;How do you unmashal a blob of YAML data into a slice of structs?&amp;rdquo; The result from Copilot is similar in quality to the result from StackOverflow. (Copilot is clearly less reliable than the excellent &lt;a href="https://gobyexample.com/"&gt;gobyexample.com&lt;/a&gt;, however.)&lt;/p&gt;

&lt;p&gt;Sometimes it improves flow state to use Copilot, because you don&amp;rsquo;t have to leave your text editor.&lt;/p&gt;

&lt;p&gt;Sometimes Copilot breaks flow state, though, by giving slow or unreliable answers. I&amp;rsquo;m torn about that part.&lt;/p&gt;

&lt;h2 id="things-that-break-flow-state"&gt;Things that break flow state&lt;/h2&gt;

&lt;p&gt;I did find a few strange things:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Failed type inferences&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The Go tooling in VSCode is mostly great, but there&amp;rsquo;s an edge case where the Go language server can fail to infer return types correctly. In the case I stumbled into, my code would still compile and run correctly, but in VSCode, some files were wrongly marked as broken, as having invalid types in their function signatures.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Circular dependencies&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;You can&amp;rsquo;t have circular package dependencies, which makes sense. But what if package A depends on package B and C, and then package C wants to use a struct that&amp;rsquo;s defined in package B?&lt;/p&gt;

&lt;p&gt;I understand it&amp;rsquo;s idiomatic to use interfaces in such circumstances, and then the interface definition can just be imported from some base package that has no dependencies. This avoids any circularities.&lt;/p&gt;

&lt;p&gt;I was, however, surprised that Go&amp;rsquo;s interfaces don&amp;rsquo;t cover fields; they only cover methods. That feels a bit redundant if all you need is to define a few accessors.&lt;/p&gt;

&lt;p&gt;This brings me to a related issue with interfaces.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Inconsistent calling semantics for interfaces&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Suppose you have an interface for a &lt;code&gt;Point&lt;/code&gt; and you have a struct that implements it:&lt;/p&gt;

&lt;pre&gt;&lt;code class="go"&gt;type Point struct {
      X int
      Y int
}

func (p Point) s() string {
      return fmt.Sprintf(&amp;quot;%d/%d&amp;quot;, p.X, p.Y)
}

type PointInterface interface {
      s() string
}
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;&lt;em&gt;This&lt;/em&gt; works:&lt;/p&gt;

&lt;pre&gt;&lt;code class="go"&gt;func printPoint(p PointInterface) {
      fmt.Println(p.s())
}

func main() {
      p := Point{5, 2}
      printPoint(p)
}
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;&lt;em&gt;This&lt;/em&gt; does &lt;strong&gt;not&lt;/strong&gt; work:&lt;/p&gt;

&lt;pre&gt;&lt;code class="go"&gt;func printList(list []PointInterface) {
      for i, _ := range list {
            fmt.Println(list[i].s())
      }
}

func main() {
      plist := []Point{{10, 20}, {20, 30}}

      printList(plist)
}
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;The error reads: &lt;code&gt;cannot use plist (variable of type []Point) as []PointInterface value in argument to printList&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Basically, you pass a &lt;code&gt;Point&lt;/code&gt; into a function that expects a &lt;code&gt;PointInterface&lt;/code&gt;. But you can&amp;rsquo;t pass &lt;code&gt;[]Point&lt;/code&gt; to a function that expects a &lt;code&gt;[]PointInterface&lt;/code&gt;. Instead, you have to do a conversion step to manually transform a slice of structs into a slice of interfaces.&lt;/p&gt;

&lt;p&gt;Per &lt;a href="https://stackoverflow.com/a/12994852"&gt;Stack Overflow&lt;/a&gt;, the explanation seems to be this: It&amp;rsquo;s an O(N) operation to convert a slice of structs into a slice of interfaces that the struct implements. And Go doesn&amp;rsquo;t want to do that automatically. But converting just one at a time is O(1) so the compiler will handle it silently.&lt;/p&gt;

&lt;p&gt;I found this unintuitive.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Not my favorite ever package manager&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;There&amp;rsquo;s nothing seriously wrong with &lt;code&gt;go mod&lt;/code&gt;, but I do not love it, either. I was confused by the way Go handles multiple modules inside a single project. I don&amp;rsquo;t really need multiple modules, as dividing my project into packages is plenty of structure for my use case; but I found it less than luminous.&lt;/p&gt;

&lt;p&gt;A lot better than C: I&amp;rsquo;ll say that much.&lt;/p&gt;

&lt;h2 id="in-sum"&gt;In sum&lt;/h2&gt;

&lt;p&gt;It&amp;rsquo;s such a nice experience to seriously sit down with a new language at work. I wouldn&amp;rsquo;t use Go for everything. And the boilerplate is slightly exhausting. But it clearly has a sweet spot, and it seems very maintainable once you write it.&lt;/p&gt;

&lt;p&gt;These are good things.&lt;/p&gt;
</content>
  </entry>
  <entry>
    <title>Mental models of the command line</title>
    <link rel="alternate" href="/tech/2024/10/bad-at-command-line.html"/>
    <id>/tech/2024/10/bad-at-command-line.html</id>
    <published>2024-10-07T08:45:00+00:00</published>
    <updated>2024-10-07T08:45:00+00:00</updated>
    <author>
      <name>Eli Thorkelson</name>
    </author>
    <summary type="html">&lt;p&gt;When I first encountered the Linux command line, I was pretty bad at it.&lt;/p&gt;

&lt;p&gt;I thought of it as a purely imperative interface.&lt;/p&gt;

&lt;p&gt;You run &lt;code&gt;xyz &amp;lt;enter&amp;gt;&lt;/code&gt; and you see some output. Then you can run &lt;code&gt;xyz&lt;/code&gt; again if you want, or a different command if you prefer.&lt;/p&gt;</summary>
    <content type="html">&lt;p&gt;When I first encountered the Linux command line, I was pretty bad at it.&lt;/p&gt;

&lt;p&gt;I thought of it as a purely imperative interface.&lt;/p&gt;

&lt;p&gt;You run &lt;code&gt;xyz &amp;lt;enter&amp;gt;&lt;/code&gt; and you see some output. Then you can run &lt;code&gt;xyz&lt;/code&gt; again if you want, or a different command if you prefer.&lt;/p&gt;

&lt;p&gt;It&amp;rsquo;s just the &amp;ldquo;push a button, get a result&amp;rdquo; model of execution. It&amp;rsquo;s not wrong — every process does return an exit code indicating success or failure — but it&amp;rsquo;s a bit limiting.&lt;/p&gt;

&lt;p&gt;The problem with the imperative model is: you can only push buttons. If you want to do something for which there&amp;rsquo;s no button, you&amp;rsquo;re stuck. It&amp;rsquo;s not a very extensible approach.&lt;/p&gt;

&lt;p&gt;I&amp;rsquo;ve learned a few things since then, but I still feel like I&amp;rsquo;m bad at the command line.&lt;/p&gt;

&lt;h2 id="map-and-filter"&gt;Map and filter&lt;/h2&gt;

&lt;p&gt;Eventually I learned python and discovered &lt;code&gt;map&lt;/code&gt; and &lt;code&gt;filter&lt;/code&gt; operators.&lt;/p&gt;

&lt;p&gt;Sometime afterwards, it started to sink in that the command line is also a playground for functional transformations of an input. (Conventionally, the inputs are line delimited plaintext or maybe CSV/TSV, but with tools like &lt;code&gt;jq&lt;/code&gt; it&amp;rsquo;s easier to transform other structured formats too. I&amp;rsquo;m constantly reaching for &lt;code&gt;curl | jq&lt;/code&gt;.)&lt;/p&gt;

&lt;p&gt;I find it helpful to think of something like &lt;code&gt;grep&lt;/code&gt; as a tool that implements &lt;code&gt;filter()&lt;/code&gt;, and something like &lt;code&gt;cut&lt;/code&gt; as something that implements &lt;code&gt;map()&lt;/code&gt;. That way, you can see the general patterns beneath the arcane Unix tool names. For example, you can &lt;code&gt;grep &amp;quot;WARN&amp;quot; log.txt | cut -d &amp;#39;:&amp;#39; -f2&lt;/code&gt; to effectively filter and then map the bits of a log line that you want to read.&lt;/p&gt;

&lt;p&gt;I&amp;rsquo;ll mention &lt;code&gt;xargs&lt;/code&gt; (or &lt;code&gt;find -exec&lt;/code&gt;) as essential functional programming tools too. Often you want to &lt;em&gt;do&lt;/em&gt; something with the results of your functional pipelines.&lt;/p&gt;

&lt;h2 id="process-stacks-and-trees"&gt;Process stacks and trees&lt;/h2&gt;

&lt;p&gt;The pipe &lt;code&gt;|&lt;/code&gt; that we use for data transformations above is obviously a way of chaining processes together in linear execution order. It&amp;rsquo;s a bit like &lt;code&gt;.then()&lt;/code&gt; in Javascript promises. But you can also think of the shell as a stack of processes that you can navigate up and down.&lt;/p&gt;

&lt;p&gt;I remember learning really early on about &lt;code&gt;&amp;amp;&lt;/code&gt; to run something in the background. I can&amp;rsquo;t say that I use &lt;code&gt;&amp;amp;&lt;/code&gt; much anymore. Remember when it was common to have an account on some big Unix server? The kind where you could leave some processes running in the background, log off, and come back later? Those days are gone. Now it&amp;rsquo;s all ephemeral containers and replaceable VMs.&lt;/p&gt;

&lt;p&gt;Anyway, the process stack got more interactive when I found out about using &lt;code&gt;^Z&lt;/code&gt; to background an interactive process, and then &lt;code&gt;fg&lt;/code&gt; to restore it. For example:&lt;/p&gt;

&lt;pre&gt;&lt;code class="bash"&gt;$ bin/rails console
&amp;gt;     # do some rails console commands
^Z    # leaves the rails console running in the background

$     # back at the regular shell to run something there...
$ fg
&amp;gt;     # back to the rails console again...
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Meanwhile, it&amp;rsquo;s also sometimes helpful to picture the larger process space as an irregular tree, one whose branches are weakly linked together by dictionaries of environment variables that &lt;a href="https://decasia.org/tech/2023/03/where-do-env-vars-come-from.html"&gt;propagate from parent to child processes&lt;/a&gt;.&lt;/p&gt;

&lt;h2 id="shell-vocabulary"&gt;Shell vocabulary&lt;/h2&gt;

&lt;p&gt;Those are a couple of conceptual approaches to the shell environment. There are functional pipelines; there are process stacks; there are process trees.&lt;/p&gt;

&lt;p&gt;But it&amp;rsquo;s not just about concepts either, right? It&amp;rsquo;s about learning the &amp;ldquo;vocabulary&amp;rdquo; or, dare I say, the &amp;ldquo;culture&amp;rdquo; of the shell environment. You can&amp;rsquo;t get too far at the command line without finding out about shell configuration and aliases, or why it&amp;rsquo;s annoying to use &lt;code&gt;sh&lt;/code&gt; for your shell, or how &lt;code&gt;$PATH&lt;/code&gt; works, or how Unix groups and file modes work. It&amp;rsquo;s hugely useful to know how redirection operators &lt;code&gt;&amp;lt;&lt;/code&gt; and &lt;code&gt;&amp;gt;&lt;/code&gt; work. And certain tools like &lt;code&gt;vim&lt;/code&gt; or &lt;code&gt;curl&lt;/code&gt; are in constant use &amp;ndash; it doesn&amp;rsquo;t always matter if there is an inner logic to them; you just have to get used to them to find your way around. (I actually like &lt;code&gt;vim&lt;/code&gt;.)&lt;/p&gt;

&lt;p&gt;The shell&amp;rsquo;s culture is kind of vast, and I always feel like I&amp;rsquo;m still learning more about it.&lt;/p&gt;

&lt;h2 id="things-im-still-bad-at"&gt;Things I&amp;rsquo;m still bad at&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Never got good at AWK.&lt;/li&gt;
&lt;li&gt;Never really learned &lt;code&gt;tmux&lt;/code&gt;. (For most use cases, split panes in a terminal emulator come close enough, or maybe splits in &lt;code&gt;vim&lt;/code&gt;.)&lt;/li&gt;
&lt;li&gt;I can kind of get by with Bash scripting. I have to constantly check the user&amp;rsquo;s guide: How do you use a hash in Bash, again? How do you remember all the arcane &lt;a href="https://tldp.org/LDP/abs/html/fto.html"&gt;test operators&lt;/a&gt;?&lt;/li&gt;
&lt;li&gt;Never learned perl or emacs (but I don&amp;rsquo;t really miss them).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It&amp;rsquo;s funny to be writing a command line tool and still feel like a beginner, in some ways.&lt;/p&gt;

&lt;p&gt;After a while, you&amp;rsquo;re not really a beginner anymore. But you&amp;rsquo;re experienced enough to start understanding your own limits, and so you still feel &amp;hellip; modest.&lt;/p&gt;
</content>
  </entry>
  <entry>
    <title>Command line tool design</title>
    <link rel="alternate" href="/tech/2024/10/command-line-design.html"/>
    <id>/tech/2024/10/command-line-design.html</id>
    <published>2024-10-01T08:01:00+00:00</published>
    <updated>2024-10-01T08:01:00+00:00</updated>
    <author>
      <name>Eli Thorkelson</name>
    </author>
    <summary type="html">&lt;p&gt;I promised another team at work I would rewrite their shell script as a command line dev tool in Golang.&lt;/p&gt;

&lt;p&gt;The old UX of the script was suboptimal. It ran a slow interactive shell script that made you go through an interactive menu tree every time,...&lt;/p&gt;</summary>
    <content type="html">&lt;p&gt;I promised another team at work I would rewrite their shell script as a command line dev tool in Golang.&lt;/p&gt;

&lt;p&gt;The old UX of the script was suboptimal. It ran a slow interactive shell script that made you go through an interactive menu tree every time, even when you already knew what you wanted to do. It had to be run inside Docker to handle a python dependency, making the initial startup even more tedious.&lt;/p&gt;

&lt;p&gt;It got me thinking: What makes a good command line tool?&lt;/p&gt;

&lt;p&gt;To be sure, we already have CLI tool frameworks (like &lt;a href="https://github.com/spf13/cobra"&gt;Cobra&lt;/a&gt;) that tell us some common best practices in this area:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Standard Linux-style argument parsing&lt;/li&gt;
&lt;li&gt;Short and long option support&lt;/li&gt;
&lt;li&gt;Subcommand support (like &lt;code&gt;git blame&lt;/code&gt;, &lt;code&gt;kubectl exec&lt;/code&gt;).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;(I&amp;rsquo;ll use &lt;a href="https://github.com/spf13/cobra"&gt;Cobra&lt;/a&gt; for extensibility, even though the Go standard lib already has a basic &lt;a href="https://pkg.go.dev/flag"&gt;flag&lt;/a&gt; package.)&lt;/p&gt;

&lt;p&gt;A few more things I thought of:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Multiple usage modes: It can let you specify the exact command you want, &lt;em&gt;or&lt;/em&gt; it can show your options if you don&amp;rsquo;t know. You can type &lt;code&gt;cook rice --color=brown&lt;/code&gt; or if you just type &lt;code&gt;cook&lt;/code&gt;, it will ask if you want to make rice, barley, orzo, or couscous.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;Explain&lt;/code&gt; and/or &lt;code&gt;dry run&lt;/code&gt; mode: It should be able to tell you what it will do without doing it. It should be able to explain its own activity in case of doubts.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;Verbose&lt;/code&gt; output: It should give you the choice between seeing the implementation details or just seeing the TDLR of what happened.&lt;/li&gt;
&lt;li&gt;Pretty output, using &lt;a href="https://github.com/fatih/color"&gt;colors&lt;/a&gt; and/or emojis nicely to convey meaning. (It&amp;rsquo;s 2024 - our terminals support Unicode too.)&lt;/li&gt;
&lt;li&gt;Diagnostics. It should be able to check its own configuration and self-describe.&lt;/li&gt;
&lt;li&gt;Solid help pages. &lt;code&gt;tool --help&lt;/code&gt; should be comprehensive. &lt;code&gt;tool subcommand --help&lt;/code&gt; should work too.&lt;/li&gt;
&lt;li&gt;Credential integrations. If you have to enter creds for any operations, it should make this as easy as possible. (For instance, using &lt;a href="https://keepassxc.org/docs/KeePassXC_UserGuide#_auto_type"&gt;password manager integrations&lt;/a&gt; when possible.)&lt;/li&gt;
&lt;li&gt;Progress bars. When we are doing slow operations, display progress nicely.&lt;/li&gt;
&lt;li&gt;Plays well with other tools. It should provide either colorized or noncolorized output; it should provide human and machine readable output; it should fit organically into the shell environment.&lt;/li&gt;
&lt;li&gt;Minimal dependencies.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A few things I don&amp;rsquo;t like in command line tools:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Inadequate &lt;code&gt;--help&lt;/code&gt; output.&lt;/li&gt;
&lt;li&gt;Non-GNU option parsing style (see: JVM tools like &lt;code&gt;keytool&lt;/code&gt;; see also: &lt;code&gt;openssl&lt;/code&gt;).&lt;/li&gt;
&lt;li&gt;Pedantic, when that adds no special value.&lt;/li&gt;
&lt;li&gt;Inflexible or difficult to configure.&lt;/li&gt;
&lt;li&gt;Poorly organized codebase (for tools where you can consult the codebase).&lt;/li&gt;
&lt;li&gt;Difficult to install.&lt;/li&gt;
&lt;li&gt;Difficult to upgrade.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I can&amp;rsquo;t promise anyone that I can accomplish 100% of this, but I think it&amp;rsquo;s a good start at a wishlist.&lt;/p&gt;

&lt;aside class="blog-aside"&gt;

&lt;p class="aside-title"&gt;Fun Fact&lt;/p&gt;

&lt;p&gt;Ironically, the `go` compiler tool does something I find really unhelpful:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;
$ go mod --help      [7:12:27]
go mod --help: unknown command
Run 'go help mod' for usage.
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Come on, couldn&amp;rsquo;t you just translate &lt;code&gt;god mod --help&lt;/code&gt; into &lt;code&gt;go help mod&lt;/code&gt; for me?&lt;/p&gt;

&lt;/aside&gt;
</content>
  </entry>
  <entry>
    <title>Pointers in Ruby</title>
    <link rel="alternate" href="/tech/2024/09/pointers-in-ruby.html"/>
    <id>/tech/2024/09/pointers-in-ruby.html</id>
    <published>2024-09-26T12:11:00+00:00</published>
    <updated>2024-09-26T12:11:00+00:00</updated>
    <author>
      <name>Eli Thorkelson</name>
    </author>
    <summary type="html">&lt;p&gt;&lt;em&gt;No one:&lt;/em&gt; It’s so sad that there are no pointers in Ruby.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Me:&lt;/em&gt; Did you know that you can trivially implement pointers, though?&lt;/p&gt;

&lt;pre&gt;&lt;code class="ruby"&gt;class Obj &amp;lt; Struct.new(:name, :shortcode, keyword_init: true)
  def ~
    Pointer.new(object_id: object_id)
  end
end

class...&lt;/code&gt;&lt;/pre&gt;</summary>
    <content type="html">&lt;p&gt;&lt;em&gt;No one:&lt;/em&gt; It&amp;rsquo;s so sad that there are no pointers in Ruby.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Me:&lt;/em&gt; Did you know that you can trivially implement pointers, though?&lt;/p&gt;

&lt;pre&gt;&lt;code class="ruby"&gt;class Obj &amp;lt; Struct.new(:name, :shortcode, keyword_init: true)
  def ~
    Pointer.new(object_id: object_id)
  end
end

class Pointer &amp;lt; Struct.new(:object_id, keyword_init: true)
  def +@
    ObjectSpace._id2ref(object_id)
  end
end
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;You can then reference and dereference using a relatively simple semantics, using &lt;code&gt;~x&lt;/code&gt; to create a Pointer from an &lt;code&gt;Obj&lt;/code&gt;, and &lt;code&gt;+y&lt;/code&gt; to dereference a &lt;code&gt;Pointer&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Here, &lt;code&gt;~x&lt;/code&gt; (tilde) is similar to C&amp;rsquo;s &lt;code&gt;&amp;amp;x&lt;/code&gt;, &amp;ldquo;get the address of &lt;code&gt;x&lt;/code&gt;&amp;rdquo;.&lt;/p&gt;

&lt;p&gt;Meanwhile &lt;code&gt;+y&lt;/code&gt; is similar to C&amp;rsquo;s &lt;code&gt;*y&lt;/code&gt;, &amp;ldquo;dereference the pointer stored at &lt;code&gt;y&lt;/code&gt;&amp;rdquo;.&lt;/p&gt;

&lt;h2 id="usage"&gt;Usage&lt;/h2&gt;

&lt;pre&gt;&lt;code class="ruby"&gt;irb(main):012&amp;gt; o=Obj.new(name: &amp;quot;A&amp;quot;)
=&amp;gt; #&amp;lt;struct Obj name=&amp;quot;A&amp;quot;, shortcode=nil&amp;gt;
irb(main):013&amp;gt; p = ~o
=&amp;gt; #&amp;lt;struct Pointer object_id=29020&amp;gt;
irb(main):014&amp;gt; +p
=&amp;gt; #&amp;lt;struct Obj name=&amp;quot;A&amp;quot;, shortcode=nil&amp;gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;h2 id="what-is-this"&gt;What&amp;hellip; is this&lt;/h2&gt;

&lt;p&gt;Well, they aren&amp;rsquo;t pointers in the C sense, exactly. But you can happily use the Ruby VM object IDs for the same concept.&lt;/p&gt;

&lt;p&gt;You just have to use &lt;code&gt;.object_id&lt;/code&gt; to fetch the Object ID, and &lt;a href="https://ruby-doc.org/3.0.4/ObjectSpace.html#method-c-_id2ref"&gt;ObjectSpace._id2ref&lt;/a&gt; to dereference. And overload some unary operators to provide … some kind of semantics.&lt;/p&gt;

&lt;p&gt;I wish I could implement this with &lt;code&gt;*&lt;/code&gt; and &lt;code&gt;&amp;amp;&lt;/code&gt;, for visual similarity to the C operators, but Ruby doesn&amp;rsquo;t support unary &lt;code&gt;*&lt;/code&gt; or &lt;code&gt;&amp;amp;&lt;/code&gt;.&lt;/p&gt;

&lt;h2 id="why"&gt;Why???&lt;/h2&gt;

&lt;p&gt;People say you can&amp;rsquo;t do pass by reference in Ruby; you have to do &amp;ldquo;pass by object reference&amp;rdquo; or &amp;ldquo;pass reference by value&amp;rdquo;.&lt;/p&gt;

&lt;p&gt;Discussion:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://robertheaton.com/2014/07/22/is-ruby-pass-by-reference-or-pass-by-value/"&gt;https://robertheaton.com/2014/07/22/is-ruby-pass-by-reference-or-pass-by-value/&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://stackoverflow.com/questions/1872110/is-ruby-pass-by-reference-or-by-value"&gt;https://stackoverflow.com/questions/1872110/is-ruby-pass-by-reference-or-by-value&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Well, now you can do pass by reference in your Ruby method calls too, if you really want to :)&lt;/p&gt;

&lt;pre&gt;&lt;code class="ruby"&gt;def welcome(obj_pointer)
  puts &amp;quot;Hello, #{(+obj_pointer).name}&amp;quot;
end

o=Obj.new(name: &amp;quot;Ambassador Spock&amp;quot;)

welcome(~o)
=&amp;gt; &amp;quot;Hello, Ambassador Spock&amp;quot;
&lt;/code&gt;&lt;/pre&gt;

&lt;h2 id="disclaimer"&gt;Disclaimer&lt;/h2&gt;

&lt;p&gt;Please don&amp;rsquo;t try C-style pointer arithmetic with these pointers.&lt;/p&gt;

&lt;p&gt;Not ready for production usage yet.&lt;/p&gt;
</content>
  </entry>
  <entry>
    <title>Always document flows</title>
    <link rel="alternate" href="/tech/2024/09/document-flows.html"/>
    <id>/tech/2024/09/document-flows.html</id>
    <published>2024-09-25T20:29:00+00:00</published>
    <updated>2024-09-25T20:29:00+00:00</updated>
    <author>
      <name>Eli Thorkelson</name>
    </author>
    <summary type="html">&lt;script type="module"&gt;
  import mermaid from 'https://cdn.jsdelivr.net/npm/mermaid@11/dist/mermaid.esm.min.mjs';
  mermaid.initialize({ startOnLoad: true, theme: 'forest' });
&lt;/script&gt;

&lt;p&gt;When you are adding something new to a large technical system, always document &lt;em&gt;flows&lt;/em&gt;.&lt;/p&gt;

&lt;h2 id="what-is-a-flow"&gt;What is...&lt;/h2&gt;</summary>
    <content type="html">&lt;script type="module"&gt;
  import mermaid from 'https://cdn.jsdelivr.net/npm/mermaid@11/dist/mermaid.esm.min.mjs';
  mermaid.initialize({ startOnLoad: true, theme: 'forest' });
&lt;/script&gt;

&lt;p&gt;When you are adding something new to a large technical system, always document &lt;em&gt;flows&lt;/em&gt;.&lt;/p&gt;

&lt;h2 id="what-is-a-flow"&gt;What is a flow?&lt;/h2&gt;

&lt;p&gt;By a &amp;ldquo;flow,&amp;rdquo; I mean a flow of information, events, user actions or business logic from point A to point B.&lt;/p&gt;

&lt;p&gt;You start somewhere. You end somewhere. You might stop at a lot of places along the way.&lt;/p&gt;

&lt;p&gt;Suppose you are implementing flows that cross lots of different systems or components.&lt;/p&gt;

&lt;p&gt;Maybe each piece of the puzzle is straightforward, or even self-documenting. It&amp;rsquo;s great when you write a system component that&amp;rsquo;s easily understandable to future maintainers.&lt;/p&gt;

&lt;p&gt;But if you have five or fifteen components, and they all have to talk to each other in a certain way, you need to understand the overall flow too.&lt;/p&gt;

&lt;p&gt;If you don&amp;rsquo;t know what the flow is, it takes a lot of archaeology work to figure it out again.&lt;/p&gt;

&lt;p&gt;That&amp;rsquo;s why you should document your flows — before you start to forget how they work.&lt;/p&gt;

&lt;h2 id="multi-system-flows"&gt;Multi system flows&lt;/h2&gt;

&lt;p&gt;Here&amp;rsquo;s an imaginary flow that involves asking a document store for a document.&lt;/p&gt;

&lt;p&gt;Behind the scenes, the document store is asking an auth service if this request is duly authenticated and authorized.&lt;/p&gt;

&lt;p&gt;In turn, the auth service is asking a directory service for information about the requester. (Let&amp;rsquo;s suppose that in this imaginary system, our authorization policies are based on group memberships in a directory service.)&lt;/p&gt;

&lt;pre class="mermaid"&gt;
sequenceDiagram
    Requester-&gt;&gt;Document Store: Requests a document
    Document Store-&gt;&gt;Auth Service: Checks authentication
    Auth Service-&gt;&gt;Directory Service: Retrieves the requester's group memberships
    Auth Service-&gt;&gt;Document Store: Indicates if the request is valid
    Document Store-&gt;&gt;Requester: Returns a document or else 401 Unauthorized
&lt;/pre&gt;

&lt;p&gt;If I gave you the source code for all these services, even if it was the most readable source code in the world, it might take you some work to figure out the order of operations of this flow.&lt;/p&gt;

&lt;p&gt;But if I gave you the source code plus this picture, you would instantly see what the overall story was, and it would be much easier to put the pieces together.&lt;/p&gt;

&lt;h2 id="but-im-working-on-a-system-that-has-no-flow-diagrams"&gt;But I&amp;rsquo;m working on a system that has no flow diagrams&lt;/h2&gt;

&lt;p&gt;It sounds like I&amp;rsquo;m talking about some utopian world where people usually document things well.&lt;/p&gt;

&lt;p&gt;In reality, we&amp;rsquo;re often working on existing systems that don&amp;rsquo;t have great documentation. They were built in a hurry or heavily modified since the initial design.&lt;/p&gt;

&lt;p&gt;Now you have to add something new, or fix some bug, and you don&amp;rsquo;t know what the larger flow is.&lt;/p&gt;

&lt;p&gt;What you should do is this: &lt;em&gt;Go ahead and draw a flow diagram&lt;/em&gt;. Even if you are documenting something that has existed for years and has never been properly documented before. The best way to understand something that already exists is to write the docs that &lt;em&gt;should&lt;/em&gt; exist for it.&lt;/p&gt;

&lt;p&gt;By writing docs for existing things, I promise you will end up understanding them better than you thought you could.&lt;/p&gt;

&lt;p&gt;(I&amp;rsquo;m not the first person to say this. I saw a team at work where every new engineer has to write docs for something as part of onboarding. I&amp;rsquo;m in awe of that.)&lt;/p&gt;

&lt;h2 id="but-what-should-my-diagram-look-like"&gt;But what should my diagram look like?&lt;/h2&gt;

&lt;p&gt;I suspect this really isn&amp;rsquo;t too important, as long as there is something that captures the system in one view.&lt;/p&gt;

&lt;p&gt;I personally am a very cartographic person and I like pictures, graphics, and technical diagrams. I love digital whiteboards like &lt;a href="https://excalidraw.com/"&gt;Excalidraw&lt;/a&gt;. The diagrams in this post are from &lt;a href="https://mermaid.js.org/"&gt;Mermaid&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;If it matters, I tend to prefer flow charts over sequence diagrams. Here&amp;rsquo;s a flow chart that covers the same things as the sequence diagram above:&lt;/p&gt;

&lt;pre class="mermaid"&gt;
flowchart TD
    Requester--&gt;|Requests a document|DS
    DS(Document Store)
    AS(Auth Service)
    Dir(Directory Service)
    DS--&gt;|Checks auth|AS
    AS--&gt;|Checks requester groups|Dir
    DS--&gt;|Returns doc or 401|Requester
&lt;/pre&gt;

&lt;p&gt;But if you aren&amp;rsquo;t into the pictures, you can document a flow in words too, or with ascii art, or whatever. It would be OK to document a flow with a quick paragraph of &amp;ldquo;TLDR: Here&amp;rsquo;s how it all works.&amp;rdquo;&lt;/p&gt;

&lt;p&gt;Here&amp;rsquo;s the sequence diagram without Mermaid:&lt;/p&gt;

&lt;pre&gt;&lt;code class="txt"&gt;Requester -&amp;gt; Document Store: Requests a document
Document Store -&amp;gt; Auth Service: Checks authentication
Auth Service -&amp;gt; Directory Service: Retrieves the requester&amp;#39;s group memberships
Auth Service -&amp;gt; Document Store: Indicates if the request is valid
Document Store -&amp;gt; Requester: Returns a document or else 401 Unauthorized
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Still pretty readable, right?&lt;/p&gt;

&lt;p&gt;I really think it&amp;rsquo;s better not to bikeshed over documentation too much. That indirectly discourages people from doing documentation, in the long run, because often people aren&amp;rsquo;t in the mood for bikeshedding.&lt;/p&gt;

&lt;h2 id="dont-document-the-wrong-things"&gt;Don&amp;rsquo;t document the wrong things.&lt;/h2&gt;

&lt;p&gt;I&amp;rsquo;m not going to comment too much on technical documentation in general, but let me just say that it&amp;rsquo;s pointless to document things that are already obvious to a competent reader.&lt;/p&gt;

&lt;p&gt;The codebase can be self documenting &lt;em&gt;up to a point&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;Let it do that. If your codebase is the source of truth for your input validations, your database schemas, your runtime configuration, or your class hierarchy, then let it express that naturally. Don&amp;rsquo;t write separate documentation for things that are sufficiently self-documenting.&lt;/p&gt;

&lt;h2 id="dont-add-new-flows-when-you-dont-have-to"&gt;Don&amp;rsquo;t add new flows when you don&amp;rsquo;t have to&lt;/h2&gt;

&lt;p&gt;Flows are often cognitively expensive.&lt;/p&gt;

&lt;p&gt;This has a practical implication. If you have the choice between using an existing flow or adding a brand new one to a system, everything else being equal, &lt;em&gt;you should prefer the former&lt;/em&gt;. Prefer fewer flows to more flows, given the choice.&lt;/p&gt;

&lt;p&gt;This can significantly reduce cognitive load for future maintainers.&lt;/p&gt;

&lt;h2 id="in-conclusion"&gt;In conclusion&lt;/h2&gt;

&lt;p&gt;Document your flows.&lt;/p&gt;

&lt;p&gt;Your future self, or your future colleagues, will appreciate it.&lt;/p&gt;
</content>
  </entry>
  <entry>
    <title>A confusing issue to debug, or the mysterious case of 5000ms server delays</title>
    <link rel="alternate" href="/tech/2024/09/a-confusing-issue.html"/>
    <id>/tech/2024/09/a-confusing-issue.html</id>
    <published>2024-09-05T21:17:00+00:00</published>
    <updated>2024-09-05T21:17:00+00:00</updated>
    <author>
      <name>Eli Thorkelson</name>
    </author>
    <summary type="html">&lt;p&gt;We had an interesting case for performance analysis this week at work. Our staging server started to display a strange pattern of intermittent slow responses.&lt;/p&gt;

&lt;p&gt;It was an unusual response pattern because it wasn’t localized to certain endpoints, &lt;em&gt;nor&lt;/em&gt;&lt;/p&gt;</summary>
    <content type="html">&lt;p&gt;We had an interesting case for performance analysis this week at work. Our staging server started to display a strange pattern of intermittent slow responses.&lt;/p&gt;

&lt;p&gt;It was an unusual response pattern because it wasn&amp;rsquo;t localized to certain endpoints, &lt;em&gt;nor&lt;/em&gt; was it a general system slowdown. Overall system resources like CPU usage and memory were all fine across all the different components of our system. However, slowness was scattered randomly across almost all our service endpoints.&lt;/p&gt;

&lt;p&gt;The problem manifested as a very particular 5-second delay, intermittently added to the server response time. So instead of having the usual response time graph — where most of the endpoints are fast and then there is a long, slightly random tail of slow responses — in this case we were looking at our regular response times, &lt;em&gt;mixed together&lt;/em&gt; with bursts of requests that took about 5 seconds.&lt;/p&gt;

&lt;p&gt;The response time logs looked about like this:&lt;/p&gt;

&lt;pre&gt;&lt;code class="txt"&gt;GET / 201ms
GET / 130ms
GET / 86ms
GET / 160ms
GET / 5102ms
GET / 5073ms
GET / 95ms
GET / 188ms
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;I&amp;rsquo;m just making up these exact numbers, but you can get the general pattern.&lt;/p&gt;

&lt;p&gt;The affected endpoints included everything &lt;em&gt;except&lt;/em&gt; the service health check endpoint. Upon investigation, the health check endpoint turns out to use a different base class from all the other requests. That was a useful finding, because it allowed us to narrow down the problem roughly to &amp;ldquo;functionality included in our standard controller base class.&amp;rdquo;&lt;/p&gt;

&lt;h2 id="early-investigation-steps"&gt;Early investigation steps&lt;/h2&gt;

&lt;p&gt;We checked out everything we could think of and couldn&amp;rsquo;t find a root cause. Not even a good clue. Lots of dead ends.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;We found a flood of erroneous healthcheck requests from a load balancer going to the wrong endpoint. Devops team fixed it.&lt;/li&gt;
&lt;li&gt;We read the codebase for our controller base class and tried to find anything that connected to external services (database IO, redis IO, etc). Found nothing useful.&lt;/li&gt;
&lt;li&gt;We checked the transaction traces from our standard observability tooling, which automatically runs profiles against slow requests and sends them for analysis. In every case, our observability tooling did &lt;em&gt;not&lt;/em&gt; pinpoint the problem. Clearly, our observability tooling had some gaps.&lt;/li&gt;
&lt;li&gt;We scanned through the application logs and the error handling reports. Found nothing.&lt;/li&gt;
&lt;li&gt;I grepped the codebase and infrastructure config for anything with a 5 second timeout. Found a few possibilities, but none of them panned out.&lt;/li&gt;
&lt;li&gt;I checked the logs and found the date when the problem started. It was the same date as a new code deployment, so I checked what changes had been deployed in that release cycle. Unfortunately, there were ~1k commits in the deployment, so it wasn&amp;rsquo;t easy to read through it all. I checked for changes to the controller base class and to the overall service configuration. Found more dead ends.&lt;/li&gt;
&lt;li&gt;Our DBA checked for suspicious or slow database queries. Found some more dead ends. Reset the postgres stats collector to get more data. Still found nothing.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;So.&lt;/p&gt;

&lt;p&gt;What do you do when you&amp;rsquo;ve checked all the usual observability tools and found nothing useful?&lt;/p&gt;

&lt;h2 id="automated-reproduction-of-the-bug"&gt;Automated reproduction of the bug&lt;/h2&gt;

&lt;p&gt;After a day of getting nowhere, I wrote a quick concurrent load test that attempted to reproduce the problem.&lt;/p&gt;

&lt;p&gt;It looked roughly like this:&lt;/p&gt;

&lt;pre&gt;&lt;code class="ruby"&gt;#!/usr/bin/env ruby

require &amp;quot;net/http&amp;quot;

# Invoke with ruby test.rb &amp;quot;https://path.to.server/endpoint&amp;quot; 20 10
# where 20 is the number of threads and 10 is the number of requests per thread

url = URI(ARGV[0])
num_threads = ARGV[1].to_i
requests_per_thread = ARGV[2].to_i

results = []
mutex = Mutex.new

def send_request(url)
  start = Time.now
  print &amp;quot;.&amp;quot;
  code = Net::HTTP.get_response(url).code

  { start_time: start, status: code, duration: Time.now - start }
end

threads = num_threads.times.map do |n|
  Thread.new do
    requests_per_thread.times.each do
      result = send_request(url)
      mutex.synchronize { results &amp;lt;&amp;lt; result }
    end
  end
end

threads.each(&amp;amp;:join)

puts &amp;quot;\nSent #{num_threads * requests_per_thread} requests to #{url}&amp;quot;
puts &amp;quot;  #{num_threads} threads x #{requests_per_thread} requests per thread&amp;quot;

results.sort_by { _1[:start_time] }.each do |r|
  puts &amp;quot;%s: %.2f ms - HTTP %d&amp;quot; % [
    r[:start_time].strftime(&amp;quot;%H:%M:%S.%L&amp;quot;),
    r[:duration] * 1000,
    r[:status]
  ]
end
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;(Disclaimer - not meant to be production ready code, just a quick script to find a problem.)&lt;/p&gt;

&lt;p&gt;The script instantly reproduced the problem, and gave us some quantitative metrics about its extent, which was super useful.&lt;/p&gt;

&lt;p&gt;I ran it a bunch of times against different endpoints, with varying thread counts and requests per thead. I think the smallest useful test set was 50-75 requests (5 threads), and the largest batch was 4000 requests (200 threads).&lt;/p&gt;

&lt;p&gt;The two initial findings from the script were:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The more concurrent requests we got, the more often the 5 second slowdown started happening.&lt;/li&gt;
&lt;li&gt;The health check endpoint never had the problem.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Later on, as we tried different things, I also used the load test script to check &amp;ldquo;Did this change make things better or worse? Is the problem still happening?&amp;rdquo; We tried a few infrastructure changes, like increasing the number of web workers and changing some karpenter node scaling configuration. They didn&amp;rsquo;t help, or at least not much.&lt;/p&gt;

&lt;h2 id="profiling-your-code"&gt;Profiling your code&lt;/h2&gt;

&lt;p&gt;Unfortunately, being able to reproduce the problem still wasn&amp;rsquo;t enough to pinpoint the root cause. And the staging system&amp;rsquo;s users were eager for answers.&lt;/p&gt;

&lt;p&gt;I tried to reproduce the problem in a few other dev/test environments. I could &lt;em&gt;not&lt;/em&gt; reproduce it on any environment besides staging.&lt;/p&gt;

&lt;p&gt;Eventually, I started to want a more detailed profile of what gets executed in our controller base class (&lt;code&gt;ApplicationController&lt;/code&gt; in Rails terminology). I set up some profilers locally (&lt;em&gt;not&lt;/em&gt; on staging) and ran it against our server (in particular, I ran &lt;a href="https://ruby-prof.github.io/"&gt;ruby-prof&lt;/a&gt; and &lt;a href="https://github.com/tmm1/stackprof"&gt;stackprof&lt;/a&gt;). The local profiler data helped me get a sense of what normal operation looks like, but, of course, it didn&amp;rsquo;t find the root cause of the staging issue.&lt;/p&gt;

&lt;p&gt;It started to seem like the only thing that would help was to run the profiler &lt;em&gt;on&lt;/em&gt; the staging system. We consider our staging system to be production-like, so we don&amp;rsquo;t normally have those kinds of tools available. We don&amp;rsquo;t usually add debugging output to our staging deployments, either.&lt;/p&gt;

&lt;p&gt;I could have gone ahead with that route. I would probably have added &lt;code&gt;StackProf::Middleware&lt;/code&gt; to the Rack stack, redeployed, re-run the load test, and dug through the output.&lt;/p&gt;

&lt;p&gt;But instead I found something better.&lt;/p&gt;

&lt;h2 id="profiling-your-code-in-prod"&gt;Profiling your code in prod&lt;/h2&gt;

&lt;p&gt;I remembered that Julia Evans had &lt;a href="https://jvns.ca/categories/ruby-profiler/"&gt;worked on a Ruby profiler a few years ago&lt;/a&gt;. This led me to the fantastic tool she built, &lt;a href="https://rbspy.github.io/"&gt;rbspy&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;What&amp;rsquo;s great about rbspy is that you don&amp;rsquo;t have to change anything about your application itself. You just run rbspy, tell it the PID of your chosen process, and let it profile your service for a while. You re-run your load test to reproduce the problem while the profiler is running. Then you inspect the results.&lt;/p&gt;

&lt;p&gt;The first time I tried to run rbspy, it crashed the web worker container by running out of memory.&lt;/p&gt;

&lt;p&gt;But we increased system resources and I tried again.&lt;/p&gt;

&lt;h2 id="all-bugs-are-shallow-with-the-right-data"&gt;All bugs are shallow with the right data&lt;/h2&gt;

&lt;p&gt;The first rbspy profiler run spotted the problem instantly. (It outputs results into a huge SVG flamechart that you can click around.)&lt;/p&gt;

&lt;p&gt;The random slowness came from &lt;code&gt;getaddrinfo&lt;/code&gt;, the &lt;a href="https://man7.org/linux/man-pages/man3/getaddrinfo.3.html"&gt;libc function that does DNS lookups&lt;/a&gt;. The calling site was inside our postgres database connection routine, which is in a library that isn&amp;rsquo;t covered by our standard instrumentation tooling.&lt;/p&gt;

&lt;p&gt;Some of the time, looking up the database hostname was taking ~5 seconds. Our application does a lot of per-request database connection switching (because reasons), so the problem could have affected almost any endpoint. Except the healthcheck.&lt;/p&gt;

&lt;p&gt;I told this to my colleague in devops and he remembered that they recently changed the hosting setup for our local dns service (coredns). They rolled back the change and the 5 second delays went away instantly.&lt;/p&gt;

&lt;h2 id="lessons-learned"&gt;Lessons learned&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;All observability tooling has gaps. It&amp;rsquo;s good to fix them, when possible.&lt;/li&gt;
&lt;li&gt;We don&amp;rsquo;t have good observability on how long DNS lookups are taking. Unfortunately, it seems to be nontrivial to instrument them on the DNS client side, because the DNS requests get sent out from getaddrinfo (I think) and there are no standard logs of getaddrinfo request performance. (Devops checked the coredns server metrics and they never showed any problems; but client-side metrics would have revealed the issue instantly.)&lt;/li&gt;
&lt;li&gt;It&amp;rsquo;s good to know the difference between &amp;ldquo;the whole system is under heavy load,&amp;rdquo; &amp;ldquo;a certain endpoint is slow,&amp;rdquo; and &amp;ldquo;a rare resource contention issue is affecting performance intermittently.&amp;rdquo; In this case - it became clearer and clearer that it had to be the last of these possibilities.&lt;/li&gt;
&lt;li&gt;For obscure issues, an automated reproduction script is great, both to help pinpoint the problem and to quickly check if your latest fix is working.&lt;/li&gt;
&lt;li&gt;Sometimes there&amp;rsquo;s no substitute for profiling in prod. (By prod, here, I mean staging. But the point is the same — you can&amp;rsquo;t always analyze problems in the development environment; you might have to analyze them where they occur.)&lt;/li&gt;
&lt;li&gt;Being able to profile a running server with rbspy, &lt;em&gt;without any code changes&lt;/em&gt;, is amazing.&lt;/li&gt;
&lt;li&gt;We still don&amp;rsquo;t know exactly why dns lookup from coredns was intermittently slow in the first place. We&amp;rsquo;ll figure it out, though.&lt;/li&gt;
&lt;/ul&gt;
</content>
  </entry>
  <entry>
    <title>Identity management in big places</title>
    <link rel="alternate" href="/tech/2024/08/identity-management.html"/>
    <id>/tech/2024/08/identity-management.html</id>
    <published>2024-08-18T12:09:00+00:00</published>
    <updated>2024-08-18T12:09:00+00:00</updated>
    <author>
      <name>Eli Thorkelson</name>
    </author>
    <summary type="html">&lt;p&gt;A thing I’ve noticed about working in a large organization is that you often have to make access requests.&lt;/p&gt;

&lt;p&gt;The access requests are something like: “I have a business reason to be able to use system X. Please grant me access.”&lt;/p&gt;

&lt;p&gt;Usually the end result...&lt;/p&gt;</summary>
    <content type="html">&lt;p&gt;A thing I&amp;rsquo;ve noticed about working in a large organization is that you often have to make access requests.&lt;/p&gt;

&lt;p&gt;The access requests are something like: &amp;ldquo;I have a business reason to be able to use system X. Please grant me access.&amp;rdquo;&lt;/p&gt;

&lt;p&gt;Usually the end result of this request is being added to a group. Sometimes it involves being given a role in some third party system, but most often it just gets you added to a new LDAP/Active Directory group. (Sometimes the LDAP/AD group is set up to imply that you should get a certain role in some third party system.)&lt;/p&gt;

&lt;p&gt;I started thinking: Shouldn&amp;rsquo;t the organization &lt;em&gt;know&lt;/em&gt; which tools I need, given the role I am in?&lt;/p&gt;

&lt;p&gt;A lot of the time, they don&amp;rsquo;t.&lt;/p&gt;

&lt;h2 id="groups-vs-roles"&gt;Groups vs roles&lt;/h2&gt;

&lt;p&gt;Let&amp;rsquo;s distinguish groups from roles.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A group is something that you &lt;em&gt;happen to belong to&lt;/em&gt;.&lt;/li&gt;
&lt;li&gt;A role describes &lt;em&gt;your local identity&lt;/em&gt; (as my dictionary puts it, a role is &amp;ldquo;the function assumed or part played in a particular situation.&amp;rdquo;)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Groups are always multiple, and you belong to them contingently. At a given time, you can belong to group A, B and C, and not to D or E, without any special contradiction. (Think of groups in the Unix sense: no systemic logic governs their assignment; you can be in &lt;em&gt;n&lt;/em&gt; different groups at once.)&lt;/p&gt;

&lt;p&gt;Meanwhile, roles are often mutually exclusive. You have Role A &lt;em&gt;or&lt;/em&gt; Role B, but not both simultaneously. You are an administrator &lt;em&gt;or&lt;/em&gt; an end user. You are playing the role of Romeo &lt;em&gt;or&lt;/em&gt; of Juliet, but not both simultaneously.&lt;/p&gt;

&lt;p&gt;(Admittedly, there can be roles that are &lt;em&gt;not&lt;/em&gt; exclusive, but I would argue that mutual exclusivity often does go along with the notion of a role. You are an individual contributor &lt;em&gt;or&lt;/em&gt; you are an engineering manager; if you try to be both, it doesn&amp;rsquo;t usually go well.)&lt;/p&gt;

&lt;p&gt;It&amp;rsquo;s usually easy, in a big organization, to know what someone&amp;rsquo;s role is. At least their &lt;em&gt;main&lt;/em&gt; role that goes on their human resources records.&lt;/p&gt;

&lt;p&gt;Why is this not the primary key for one&amp;rsquo;s system access needs?&lt;/p&gt;

&lt;h2 id="seeing-like-a-state"&gt;Seeing like a state&lt;/h2&gt;

&lt;p&gt;In the abstract, it makes sense to allocate permissions by role. We know what role you are in, so we should be able to give you access accordingly. (By &lt;em&gt;role&lt;/em&gt; in this sense, I mean &amp;ldquo;your primary job role.&amp;rdquo;)&lt;/p&gt;

&lt;p&gt;Somehow, in a big enough organization, this doesn&amp;rsquo;t work well at all.&lt;/p&gt;

&lt;p&gt;I&amp;rsquo;ve never actually read James C. Scott&amp;rsquo;s &lt;a href="https://en.wikipedia.org/wiki/Seeing_Like_a_State"&gt;Seeing Like a State&lt;/a&gt;. But from what I understand, it&amp;rsquo;s all about how top down visibility never quite works out. Big organizations systematically can&amp;rsquo;t quite see everything that&amp;rsquo;s happening on the ground across all the different territories they govern. When top-down visibility is imposed, this erases local diversity, making things awkward or disastrous, and leaves things out.&lt;/p&gt;

&lt;p&gt;From the perspective of centralized IT or HR systems, I don&amp;rsquo;t think anyone knows the exact tools that hundreds or thousands of different employees need to use. It&amp;rsquo;s too difficult to infer from someone&amp;rsquo;s role what they will need to access.&lt;/p&gt;

&lt;p&gt;Instead, we adjudicate this question using the &lt;em&gt;access request&lt;/em&gt;.&lt;/p&gt;

&lt;h2 id="access-requests-or-decentralized-access-control"&gt;Access requests, or decentralized access control&lt;/h2&gt;

&lt;p&gt;An access request: &amp;ldquo;I need access to system X because Y.&amp;rdquo; The &lt;em&gt;Y&lt;/em&gt; provides individual context, which is lacking in the top-down role-based model.&lt;/p&gt;

&lt;p&gt;Access requests are evaluated case by case. I&amp;rsquo;m a developer who needs access to production logs to analyze an error. I&amp;rsquo;m a manager who needs access to usage statistics to see how our product rollout is going. I&amp;rsquo;m an on-call engineer who needs access to some infrastructure settings to fix a problem. I want to get to review the source code of some other team whose source code I can&amp;rsquo;t see.&lt;/p&gt;

&lt;p&gt;All these kinds of things end up in an access request. The access request is the court of appeals for all the exceptions to the rules, for all the unforeseen needs. It changes the unit of analysis from roles to group memberships.&lt;/p&gt;

&lt;p&gt;In a way, it&amp;rsquo;s great that it works like this. It allows individuals to figure out what they need, and the organization to accommodate. It makes big organizations more flexible, more decentralized.&lt;/p&gt;

&lt;p&gt;But it&amp;rsquo;s also a symptom of the opacity of big organizations. The people who run centralized IT systems don&amp;rsquo;t &lt;em&gt;know&lt;/em&gt; what you need. Why do two developers in the same team need different access permissions? There isn&amp;rsquo;t always a good philosophical answer to that. Everything is context.&lt;/p&gt;

&lt;h2 id="what-then"&gt;What then?&lt;/h2&gt;

&lt;p&gt;As you spend time in this kind of system, your access requests pile up. You can access more things. You belong to more LDAP groups.&lt;/p&gt;

&lt;p&gt;There usually isn&amp;rsquo;t any clear lifecycle for an access request. Once you have it, you have it. You usually aren&amp;rsquo;t asked, &amp;ldquo;Do you still need this?&amp;rdquo; You might end up retaining access to things you don&amp;rsquo;t still need. When you change roles, you might still keep some of your old permissions, because permissions weren&amp;rsquo;t tied to your role in the first place.&lt;/p&gt;

&lt;p&gt;From a security policy perspective, it&amp;rsquo;s bad to have group memberships that don&amp;rsquo;t necessarily expire. That adds unneeded risk.&lt;/p&gt;

&lt;p&gt;All the access requests also take time, which you don&amp;rsquo;t notice, because it&amp;rsquo;s spread out over months or years.&lt;/p&gt;

&lt;p&gt;But you start to notice the inefficiency of this when someone new shows up and wants to use the tools you have. &amp;ldquo;Oh, you have the same role as me? But if you want access to the tools I use, you have to make 17 separate access requests, and we don&amp;rsquo;t even remember how to submit all the access requests&amp;hellip;&amp;rdquo;&lt;/p&gt;

&lt;h2 id="whats-the-moral-of-this"&gt;What&amp;rsquo;s the moral of this?&lt;/h2&gt;

&lt;p&gt;Access control is an area where we have to compromise between top-down visibility and local autonomy. (It rarely works to create top-down access control systems that are too rigid, because if people can&amp;rsquo;t do their jobs because they can&amp;rsquo;t get access to the right systems, they tend to look for workarounds.)&lt;/p&gt;

&lt;p&gt;When I started writing this, I imagined that permissions ideally should be top-down, and purely role based.&lt;/p&gt;

&lt;p&gt;Now that I&amp;rsquo;m at the end, I feel like I understand more clearly why we need autonomy and case-by-case evaluations.&lt;/p&gt;
</content>
  </entry>
  <entry>
    <title>The perils of IPv6, or, why I got rid of my virtual private server</title>
    <link rel="alternate" href="/tech/2024/01/the-perils-of-ipv6.html"/>
    <id>/tech/2024/01/the-perils-of-ipv6.html</id>
    <published>2024-01-21T09:20:00+00:00</published>
    <updated>2024-01-21T09:20:00+00:00</updated>
    <author>
      <name>Eli Thorkelson</name>
    </author>
    <summary type="html">&lt;p&gt;I mentioned last year that &lt;a href="https://decasia.org/tech/2023/05/downsizing-servers.html"&gt;I “downsized” my web projects from DigitalOcean to super cheap EC2&lt;/a&gt;. Naturally, a few months later AWS announced that they would start charging a few dollars per month for a dedicated IPv4 address.&lt;/p&gt;

&lt;p&gt;I &lt;a href="https://news.ycombinator.com/item?id=36992387"&gt;asked around on Hacker...&lt;/a&gt;&lt;/p&gt;</summary>
    <content type="html">&lt;p&gt;I mentioned last year that &lt;a href="https://decasia.org/tech/2023/05/downsizing-servers.html"&gt;I &amp;ldquo;downsized&amp;rdquo; my web projects from DigitalOcean to super cheap EC2&lt;/a&gt;. Naturally, a few months later AWS announced that they would start charging a few dollars per month for a dedicated IPv4 address.&lt;/p&gt;

&lt;p&gt;I &lt;a href="https://news.ycombinator.com/item?id=36992387"&gt;asked around on Hacker News&lt;/a&gt; and decided that the simplest solution was to put the site on Cloudflare&amp;rsquo;s free proxy plan, and then update the web server to only use IPv6, thus saving on the IPv4 costs. The way it works is this:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You import your DNS records into Cloudflare.&lt;/li&gt;
&lt;li&gt;Cloudflare inspects your A and/or AAAA records to figure out how to find your back end service.&lt;/li&gt;
&lt;li&gt;Cloudflare then broadcasts its own A and AAAA proxy addresses when asked to resolve your host.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The key point is that you can have an upstream service that is only accessible via IPv6, and Cloudflare helpfully still proxies traffic to it through its set of IPv4 addresses, maintaining compatibility for legacy IPv4 clients.&lt;/p&gt;

&lt;p&gt;That all sounded pretty straightforward, but I still learned a few things from my upgrade experience:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;I&amp;rsquo;m bad at network configuration.&lt;/li&gt;
&lt;li&gt;Cloudflare is kind of opaque.&lt;/li&gt;
&lt;li&gt;AWS makes it really hard to support IPv4 to IPv6 transitions for running EC2 instances.&lt;/li&gt;
&lt;li&gt;In the end, it&amp;rsquo;s not really worth it.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id="how-to-assign-an-ipv6-to-your-ec2-instance"&gt;How to assign an IPv6 to your EC2 instance&lt;/h2&gt;

&lt;p&gt;There is a field in the EC2 instance properties that indicates your IPv6. Mine was blank.&lt;/p&gt;

&lt;p&gt;I hadn&amp;rsquo;t really understood that if you create an EC2 instance, it comes with an entire virtual private networking setup. There&amp;rsquo;s something so &lt;em&gt;easy&lt;/em&gt; about old-school IPv4 networking. Your server gets a public IP address, it can reach the whole public internet with zero complexity, you advertise your IP address in a DNS A record, and you&amp;rsquo;re done.&lt;/p&gt;

&lt;p&gt;IPv6 isn&amp;rsquo;t like that, because huge numbers of clients still just don&amp;rsquo;t support IPv6. And the AWS upgrade path from IPv4 to IPv6 is thorny.&lt;/p&gt;

&lt;p&gt;The upgrade path to enabling IPv6 is:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Assign a /64 prefix to your VPC (virtual private cloud). I guess an IPv6 address range isn&amp;rsquo;t allocated by default.&lt;/li&gt;
&lt;li&gt;Assign a /64 prefix to your subnet.&lt;/li&gt;
&lt;li&gt;Then you can enable IPv6 address assignment from your EC2 instance&amp;rsquo;s network interface settings.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In general, I got the impression from AWS&amp;rsquo;s own blog posts that they only support &amp;ldquo;dual stack&amp;rdquo; networking as a sort of awkward transitional path.&lt;/p&gt;

&lt;p&gt;https://aws.amazon.com/blogs/networking-and-content-delivery/dual-stack-ipv6-architectures-for-aws-and-hybrid-networks/&lt;/p&gt;

&lt;p&gt;https://superuser.com/questions/1801957/how-to-turn-on-ipv6-on-amazon-ec2-instance&lt;/p&gt;

&lt;h2 id="after-you-finish-getting-an-ipv6-address"&gt;After you finish getting an IPv6 address&lt;/h2&gt;

&lt;p&gt;You have to reconfigure all your services that listen on a network port to listen on the IPv6 interface. By default, NGINX and sshd will only listen on IPv4.&lt;/p&gt;

&lt;p&gt;If you try removing the IPv4 address from your virtual server, and you didn&amp;rsquo;t figure out that sshd needs reconfiguring, then you can&amp;rsquo;t ssh to your instance anymore. (It&amp;rsquo;s not difficult to log in with the remote web console and fix it. It&amp;rsquo;s just annoying. It&amp;rsquo;s death by a thousand cuts, and by death I mean nuisance.)&lt;/p&gt;

&lt;p&gt;Anyway, I eventually got my webserver reconfigured to use only IPv6. If I were doing it again — I would just make a new instance that was IPv6 by default, and delete the old one.&lt;/p&gt;

&lt;p&gt;But that wasn&amp;rsquo;t the bad part.&lt;/p&gt;

&lt;p&gt;The bad part was: No matter what I did, I could not figure out how to delete all the IPv4 virtual private networking components from AWS. For something like 4 months, I kept paying the useless IPv4 fee, because I just could not figure out how to avoid it.&lt;/p&gt;

&lt;p&gt;I probably should have just asked my work colleagues in devops what to do. But it really &lt;em&gt;seemed&lt;/em&gt; like it should not be that hard.&lt;/p&gt;

&lt;h2 id="in-the-end"&gt;In the end&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;I started writing this post last January and only got around to publishing it in August. Oops.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;There&amp;rsquo;s a point of diminishing returns for personal projects. Eventually you can run out of spare energy for them.&lt;/p&gt;

&lt;p&gt;I deleted my EC2 instance and moved all the websites to static S3 buckets.&lt;/p&gt;

&lt;p&gt;There was only one dynamic web application left on my server. It was only for a niche &lt;a href="https://rip-my-academic-career.decasia.org"&gt;digital art project&lt;/a&gt;, so I just left it broken. It didn&amp;rsquo;t seem worth migrating it to use a serverless architecture (AWS Lambda).&lt;/p&gt;

&lt;p&gt;My AWS bills went down to like 25 cents per month. So in the end - it&amp;rsquo;s like a 20x cost savings over the $6 Digital Ocean box I used to have.&lt;/p&gt;

&lt;p&gt;The downside is that I don&amp;rsquo;t have a virtual Linux box anymore. But at the same time, it&amp;rsquo;s a relief not to have to do maintenance or system updates anymore.&lt;/p&gt;
</content>
  </entry>
  <entry>
    <title>Software and softness</title>
    <link rel="alternate" href="/tech/2023/09/software-and-softness.html"/>
    <id>/tech/2023/09/software-and-softness.html</id>
    <published>2023-09-16T07:45:00+00:00</published>
    <updated>2023-09-16T07:45:00+00:00</updated>
    <author>
      <name>Eli Thorkelson</name>
    </author>
    <summary type="html">&lt;p&gt;It’s funny that we build “software” and yet, so much of the time, our technical communities do not particularly value &lt;em&gt;softness&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;How do we explain this?&lt;/p&gt;

&lt;h2 id="what-makes-software-soft"&gt;What makes software soft?&lt;/h2&gt;

&lt;p&gt;What makes software “soft”? It’s hard to find an adequate account of...&lt;/p&gt;</summary>
    <content type="html">&lt;p&gt;It&amp;rsquo;s funny that we build &amp;ldquo;software&amp;rdquo; and yet, so much of the time, our technical communities do not particularly value &lt;em&gt;softness&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;How do we explain this?&lt;/p&gt;

&lt;h2 id="what-makes-software-soft"&gt;What makes software soft?&lt;/h2&gt;

&lt;p&gt;What makes software &amp;ldquo;soft&amp;rdquo;? It&amp;rsquo;s hard to find an adequate account of this because language evolves gradually, following obscure histories. But the history of the word is somewhat illuminating.&lt;/p&gt;

&lt;p&gt;The term &lt;em&gt;software&lt;/em&gt; is defined in opposition to &lt;em&gt;hardware&lt;/em&gt;. &amp;ldquo;Hardware&amp;rdquo; is an old word that historically has not had too much philosophical baggage, as far as I can tell: in the sense of &amp;ldquo;small metal items,&amp;rdquo; the word entered the English language in the early modern period, when it meant &amp;ldquo;&lt;em&gt;ware&lt;/em&gt; (such as fittings, cutlery, tools, utensils, or parts of machines) made of &lt;em&gt;metal&lt;/em&gt;&amp;rdquo; (per &lt;a href="https://www.merriam-webster.com/dictionary/hardware#dictionary-entry-1"&gt;Merriam-Webster&lt;/a&gt;). The term dates as far back as 1419, while the related term &amp;ldquo;hardware store&amp;rdquo; entered circulation by 1789.&lt;/p&gt;

&lt;p&gt;The term &lt;em&gt;software&lt;/em&gt;, by contrast, is a decidedly 20th century invention. Some say it first appeared in print in 1958 in an article by statistician John W. Tukey. Tukey wrote:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Today the “software” comprising the carefully planned interpretive routines, compilers, and other aspects of automative programming are at least as important to the modern electronic calculator as its “hardware” of tubes, transistors, wires, tapes and the like. (&lt;a href="https://www.sciencenews.org/article/softwares-origin"&gt;src&lt;/a&gt;)&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;You can see the scare quotes suggesting that at the time, this terminology was a neologism, not standard usage.&lt;/p&gt;

&lt;p&gt;It&amp;rsquo;s unlikely that Tukey had coined the word. The engineer Paul Niquette, in an odd &lt;a href="http://www.niquette.com/books/softword/part2.htm"&gt;online autobiography&lt;/a&gt; (&lt;a href="https://web.archive.org/web/20190808133200/http://www.niquette.com/books/softword/part2.htm"&gt;archive&lt;/a&gt;), recounts having discovered the term &amp;ldquo;software&amp;rdquo; as early as 1953, while working on the early computer &lt;a href="https://en.wikipedia.org/wiki/SWAC_(computer)"&gt;SWAC&lt;/a&gt;. He says he had the following &amp;ldquo;epiphany&amp;rdquo;:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;I was thinking to myself that I wanted nothing to do with the SWAC &amp;ldquo;hardware&amp;rdquo; &amp;ndash; that the machine was the mindless means for executing my programs &amp;ndash; a necessary evil, mostly evil. It was about at that moment, I seized upon the consummate reality of what I was doing &amp;ndash; that what I was doing was sharply different from what [the hardware maintainer] Dr. Whitcomb was doing &amp;ndash; that what I was doing was writing on a coding sheet, not plugging jacks into sockets, not clipping leads onto terminal posts, not soldering wires, not bending relay contacts, not replacing vacuum tubes. What I was doing was writing on a coding sheet! The exclamation point was right there in my thought back then and in my memory now.&lt;/p&gt;

&lt;p&gt;It was October 1953 and I was experiencing an epiphany. Before my eyes, I saw my own markings carefully scrawled inside printed blocks on the coding sheet. They comprised numerical &amp;ldquo;words&amp;rdquo; &amp;ndash; the only vocabulary the computer could understand. My coded words were not anything like those other things &amp;ndash; those machine things, those &amp;ldquo;hardware&amp;rdquo; things. I could write down numerical words &amp;ndash; right or wrong &amp;ndash; and after they were were punched into cards and fed into the reader, the SWAC would be commanded to perform my mandated operations in exactly the sequence I had written them &amp;ndash; right or wrong.&lt;/p&gt;

&lt;p&gt;The written codes &amp;ndash; my written codes &amp;ndash; had absolute power over Dr. Whitcomb&amp;rsquo;s &amp;ldquo;hardware.&amp;rdquo; Then too, I could erase what I had written down and write down something different, then punch a new card and insert it into the deck. The SWAC, slavishly obedient in its hardware ways, would then be commanded to do my work differently &amp;ndash; to do different work entirely, in fact. The writing on the coding sheet was changeable; it was decidedly not hardware. It was &amp;ndash; well, it was &amp;ldquo;soft-ware.&amp;rdquo;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This terminology was new in the 1950s, but it emerged from a longstanding current in intellectual history. The software/hardware divide is arguably just a new permutation of a long-standing &lt;a href="https://iep.utm.edu/dualism-and-mind/"&gt;dualism&lt;/a&gt; in European philosophy, according to which there is a radical difference between mind and matter, the ideal and the physical.&lt;/p&gt;

&lt;p&gt;You don&amp;rsquo;t have to read too much Plato to see how strongly-rooted this kind of dualist view can be. And it usually comes with a preference for one side over the other: the ideal gets valorized, the merely physical gets put down. That&amp;rsquo;s what Niquette was doing when he declared that software had &amp;ldquo;absolute power&amp;rdquo; over the hardware, which was &amp;ldquo;slavishly obedient&amp;rdquo; to the software&amp;rsquo;s &amp;ldquo;commands&amp;rdquo;. Mind over matter.&lt;/p&gt;

&lt;p&gt;At the same time, in an interesting wrinkle, software was imagined as metaphorically soft because it was more &amp;ldquo;changeable&amp;rdquo; than the hardware. It&amp;rsquo;s easy to think new thoughts and write new code, while it&amp;rsquo;s hard to wire up new circuits: that&amp;rsquo;s the argument.&lt;/p&gt;

&lt;p&gt;Does it make any sense, though?&lt;/p&gt;

&lt;h2 id="two-kinds-of-hardness"&gt;Two kinds of &amp;ldquo;hardness&amp;rdquo;&lt;/h2&gt;

&lt;p&gt;Hardness means several quite different (orthogonal) things, two of which tend to get mashed together in talk about software vs hardware.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Hard as in &lt;em&gt;metal&lt;/em&gt;.&lt;/li&gt;
&lt;li&gt;Hard as in &lt;em&gt;difficult&lt;/em&gt;.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;But these are not the same. At all.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Physical&lt;/em&gt; hardness is something you test in an engineering lab, &lt;a href="https://www.unsw.edu.au/science/our-schools/materials/engage-with-us/high-school-students-and-teachers/online-tutorials/materials-testing/hardness-testing"&gt;for example&lt;/a&gt; by applying a known amount of force to a surface and measuring the resulting indentation. Metals tend to be hard in this purely physical sense (though they aren&amp;rsquo;t always).&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Difficulty&lt;/em&gt; meanwhile has nothing to do with physical properties alone, and has everything to do with the &lt;em&gt;relationship&lt;/em&gt; between an acting subject and a practical task. What&amp;rsquo;s easy to one person can be difficult to someone else; difficulty is relative to your capacities to solve a given problem in the world.&lt;/p&gt;

&lt;p&gt;Dumb example: swimming can be very hard, experientially, even though nothing about the &lt;em&gt;water&lt;/em&gt; is physically hard.&lt;/p&gt;

&lt;p&gt;This suggests to me that it doesn&amp;rsquo;t make much sense to assume that &amp;ldquo;software,&amp;rdquo; by virtue of being a set of computer procedures and not a bag of bolts and wires, has any intrinsic softness in the sense of malleability, ease of change.&lt;/p&gt;

&lt;h2 id="software-isnt-easy-to-change-actually"&gt;Software isn&amp;rsquo;t easy to change, actually&lt;/h2&gt;

&lt;p&gt;I would go farther: it&amp;rsquo;s just false that software is easier to change than hardware. Niquette argued that &amp;ldquo;The writing on the coding sheet was changeable; it was decidedly not hardware.&amp;rdquo; But all this shows is that changing the code was easy &lt;em&gt;for him&lt;/em&gt;. Who&amp;rsquo;s to say what was easier for the hardware specialist Dr. Whitcomb across the room?&lt;/p&gt;

&lt;p&gt;In truth, even for software professionals, &amp;ldquo;software&amp;rdquo; often &lt;em&gt;isn&amp;rsquo;t&lt;/em&gt; very easy to change. On the contrary. Software problems can be absolutely intractable. Programs are complex systems that can seem to resist your efforts to alter them. All working software developers have had the experience of finding that a simple change is impossible to implement in the time available.&lt;/p&gt;

&lt;p&gt;Software can be hard.&lt;/p&gt;

&lt;p&gt;The people who talk about how easy it is to change the software&amp;hellip; tend to be people who just happen to be &lt;em&gt;good at software and not especially good at hardware&lt;/em&gt;. I&amp;rsquo;m not great at hardware - I&amp;rsquo;ve never built a computer from scratch (though that would be fun). I can splice a wire and build a thing or two with an Arduino and that&amp;rsquo;s about it. Hardware is hard &lt;em&gt;for me&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;But this is saying more about my own particular expertise than anything else.&lt;/p&gt;

&lt;p&gt;Meanwhile, I&amp;rsquo;ve met electricians who can wire all kinds of amazing hardware from scratch, and would never be able to write a computer program. Software would be super hard &lt;em&gt;for them&lt;/em&gt;. Nothing &amp;ldquo;easy to modify&amp;rdquo; about it.&lt;/p&gt;

&lt;h2 id="the-kind-of-hardness-that-software-people-like"&gt;The kind of hardness that software people like&lt;/h2&gt;

&lt;p&gt;As it turns out, software people seem to really &lt;em&gt;like&lt;/em&gt; hard problems. There&amp;rsquo;s even a prestige associated with working on clever algorithms, fancy architecture, large-scale systems and loads. There&amp;rsquo;s a corresponding disdain for &amp;ldquo;easy&amp;rdquo; problems: CRUD apps, small-scale projects, following existing patterns.&lt;/p&gt;

&lt;p&gt;Certain kinds of difficulty, of course, are preferred over others. Logical and computational difficulty are valued; organizational, political or emotional difficulty, often much less so. In this, we&amp;rsquo;re still a field that&amp;rsquo;s captive to the nerdy technocratic values of 1960s engineering culture.&lt;/p&gt;

&lt;p&gt;I just started reading Emily Chang&amp;rsquo;s &lt;a href="https://www.penguinrandomhouse.com/books/547571/brotopia-by-emily-chang/"&gt;Brotopia&lt;/a&gt;. It explains in detail how this nerdy culture was partly &lt;em&gt;invented&lt;/em&gt; by psychologists, to the great detriment of women in tech. I&amp;rsquo;ll have to write more about that sometime.&lt;/p&gt;

&lt;p&gt;In any case, there&amp;rsquo;s something more than a bit gendered about what&amp;rsquo;s valued in software culture and what&amp;rsquo;s not. For example, we don&amp;rsquo;t give as much respect to &amp;ldquo;soft skills&amp;rdquo; as we could, and these have a historical association with femininity. Skills such as reading the room, relationship-building, empathy, caretaking, and sociability are not highly valued by the nerdy side of programming culture.&lt;/p&gt;

&lt;p&gt;It&amp;rsquo;s sometimes assumed that these interpersonal skills are also &amp;ldquo;soft&amp;rdquo; in the sense of being easy and nontechnical. They can be cast as something that programmers can disdain. They probably shouldn&amp;rsquo;t be called &amp;ldquo;soft&amp;rdquo; in the first place; we can distinguish math skills from social skills without alleging that one is more technical, harder and more prestigious than the other. Skills are just skills; there is no clear hierarchy of them. And it seems to me that all the &amp;ldquo;soft skills&amp;rdquo; are good for everybody, of whatever gender, and are not necessarily easy to acquire either.&lt;/p&gt;

&lt;p&gt;There may be other kinds of softness to think about as well: I&amp;rsquo;m not sure how to think about this systematically. Maybe we should not be so quick to use softness and hardness as metaphors for things in the world in the first place.&lt;/p&gt;

&lt;p&gt;But I would also say that there is nothing shameful about softness. To the extent that it is used to describe valuable things in the world, I wish it were more highly valued in the software field.&lt;/p&gt;
</content>
  </entry>
  <entry>
    <title>How to write your own Jira client and suffer slightly less</title>
    <link rel="alternate" href="/tech/2023/08/jira.html"/>
    <id>/tech/2023/08/jira.html</id>
    <published>2023-08-22T07:59:00+00:00</published>
    <updated>2023-08-22T07:59:00+00:00</updated>
    <author>
      <name>Eli Thorkelson</name>
    </author>
    <summary type="html">&lt;p&gt;So, Jira.&lt;/p&gt;

&lt;p&gt;I do not love it.&lt;/p&gt;

&lt;h2 id="preliminary-concessions"&gt;Preliminary concessions&lt;/h2&gt;

&lt;p&gt;Admittedly, Jira does some things well.&lt;/p&gt;

&lt;p&gt;It’s a good system of record. We don’t &lt;em&gt;delete&lt;/em&gt; anything from it, so you can check project histories from years ago. It tends to keep records of things we...&lt;/p&gt;</summary>
    <content type="html">&lt;p&gt;So, Jira.&lt;/p&gt;

&lt;p&gt;I do not love it.&lt;/p&gt;

&lt;h2 id="preliminary-concessions"&gt;Preliminary concessions&lt;/h2&gt;

&lt;p&gt;Admittedly, Jira does some things well.&lt;/p&gt;

&lt;p&gt;It&amp;rsquo;s a good system of record. We don&amp;rsquo;t &lt;em&gt;delete&lt;/em&gt; anything from it, so you can check project histories from years ago. It tends to keep records of things we didn&amp;rsquo;t get around to doing, so if we forgot to do X, Jira might remind us later. It&amp;rsquo;s OK at managing distributed organizational processes like multi team approvals, or requests to other teams.&lt;/p&gt;

&lt;p&gt;Above all, it provides a kind of &lt;em&gt;visibility&lt;/em&gt; for management into the engineering process. We use it for reporting high level project status to our chain of command, and for release management. It&amp;rsquo;s good at tracking &amp;ldquo;In what release will XYZ get released to customers?&amp;rdquo;&lt;/p&gt;

&lt;h2 id="why-i-dont-love-jira"&gt;Why I don&amp;rsquo;t love Jira&lt;/h2&gt;

&lt;p&gt;Above all, Jira has a really bad user interface, compared to something like Asana. I remember what good UIs feel like. They feel enjoyable, quick to navigate, discoverable, intuitive. Jira is configurable and extensible and highly integrated with other systems, but in my opinion, it has absolutely awful UX and UI. This means that a big chunk of my workday involves bad UX and bad UI.&lt;/p&gt;

&lt;p&gt;And above all, it is horrendously slow. It takes &lt;em&gt;forever&lt;/em&gt; (in our environment) to create a new Jira (I should time it; sometimes it seems like it&amp;rsquo;s 20+ seconds). It&amp;rsquo;s long enough to get bored and want to change focus to something else. Making a new ticket once is bad; making 20 Jira tickets at once is an exceptionally tedious activity, of a kind that software developers should not have to suffer through.&lt;/p&gt;

&lt;p&gt;Finally, Jira is bad at some of the things you would &lt;em&gt;expect&lt;/em&gt; it to be really good at. For example, Jira is really bad at managing tasks and todos. It&amp;rsquo;s bad at managing workload. It&amp;rsquo;s even bad at managing &lt;em&gt;projects&lt;/em&gt;.&lt;/p&gt;

&lt;h2 id="jira-is-bad-at-tasks"&gt;Jira is bad at tasks&lt;/h2&gt;

&lt;p&gt;It&amp;rsquo;s too slow and tedious to put every single thing you need to do into Jira. As a result, most people don&amp;rsquo;t do that. Some people use Microsoft To Do instead, which is single-user task tracking software that actually works fairly well. It has a clean interface, and it&amp;rsquo;s a desktop app with desktop app-quality performance.&lt;/p&gt;

&lt;p&gt;A lot of people use todos in Google Docs or in Confluence. It&amp;rsquo;s a lot more lightweight – you can tag a @username and a checkbox will just auto-appear.&lt;/p&gt;

&lt;p&gt;Honestly, sometimes I just make todos using MacOS&amp;rsquo;s Stickies or a plain text doc. I also love Slack&amp;rsquo;s reminders for lightweight todos.&lt;/p&gt;

&lt;p&gt;It&amp;rsquo;s not just individual workflow tasks that don&amp;rsquo;t fit nicely in Jira: it can be team workflow stuff too.&lt;/p&gt;

&lt;h2 id="jira-is-bad-at-workload-planning"&gt;Jira is bad at workload planning&lt;/h2&gt;

&lt;p&gt;We&amp;rsquo;re constantly trying to plan our workload from one sprint to the next. (Everything is divided into sprints, even though we don&amp;rsquo;t really use agile processes in most other ways.)&lt;/p&gt;

&lt;p&gt;You can assign your Jira tickets to a sprint and give them a workload estimate. I&amp;rsquo;ve found that Jira sprint plans are a mixed blessing. They&amp;rsquo;re too granular, for one thing. And yet there&amp;rsquo;s a lot that we never capture on a Jira - the time cost of going to meetings, the time cost of &lt;em&gt;updating Jira&lt;/em&gt;. We end up getting pulled into new questions and discussions from one day to the next, in nonlinear ways that defy the plans again and again.&lt;/p&gt;

&lt;p&gt;But no one can realistically plan their sprints more than 1 or 2 at a time. So if you want to know &amp;ldquo;Who will be available to start a new project in 8 weeks,&amp;rdquo; Jira is useless.&lt;/p&gt;

&lt;p&gt;Sometimes we just use a Google sheet for workload planning.&lt;/p&gt;

&lt;h2 id="jira-is-bad-at-project-management"&gt;Jira is bad at project management&lt;/h2&gt;

&lt;p&gt;I&amp;rsquo;ve noticed that when my organization needs actual large-scale project management software, we just use Smartsheet. It&amp;rsquo;s just a much better tool that models the project management problem better. I see why skilled project managers prefer it.&lt;/p&gt;

&lt;h2 id="voting-with-your-feet"&gt;&amp;ldquo;Voting with your feet&amp;rdquo;&lt;/h2&gt;

&lt;p&gt;In short, for a lot of problems, Jira is a bad tool, &lt;em&gt;even when it theoretically provides the right feature set for the task.&lt;/em&gt; So people vote with their feet: they just skip Jira where it&amp;rsquo;s not effective and use something else.&lt;/p&gt;

&lt;p&gt;This is a good strategy. I endorse it (when practicable).&lt;/p&gt;

&lt;p&gt;But sometimes Jira can&amp;rsquo;t be ignored.&lt;/p&gt;

&lt;p&gt;What are our options then?&lt;/p&gt;

&lt;h2 id="lets-write-our-own-jira-client"&gt;Let&amp;rsquo;s write our own Jira client&lt;/h2&gt;

&lt;p&gt;So here&amp;rsquo;s the thing about Jira. It has reasonable JSON-based &lt;a href="https://docs.atlassian.com/jira-software/REST/9.10.0/"&gt;REST APIs&lt;/a&gt; and an excellent SQL-like query language, &lt;a href="https://support.atlassian.com/jira-software-cloud/docs/use-advanced-search-with-jira-query-language-jql/"&gt;JQL&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;And this means &lt;strong&gt;you don&amp;rsquo;t have to suffer through all the bad parts of Jira.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;To be clear, you can&amp;rsquo;t fix the Jira &lt;em&gt;workflow&lt;/em&gt; for yourself. If that&amp;rsquo;s what you hate, I can&amp;rsquo;t help you.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;But&lt;/em&gt; it&amp;rsquo;s absolutely possible to fix the awful UX and dodge the disastrous slowness of the UI.&lt;/p&gt;

&lt;p&gt;We are &lt;em&gt;software developers&lt;/em&gt;, are we not? If someone asks us to use a horrible interface all day, and there is an easy workaround, shouldn&amp;rsquo;t we just &amp;hellip; use it?&lt;/p&gt;

&lt;p&gt;I ended up writing my own Jira client. It&amp;rsquo;s a native MacOS app that connects to Jira, fetches useful data about tickets I care about, and displays precisely the fields that I want to see. It uses SwiftUI, so it has native app performance and uses native app widgets. I added some custom functionality that I want (e.g. it can track my current priorities if I have a lot on my plate, and it can store notes on tickets that are private to me.)&lt;/p&gt;

&lt;p&gt;It&amp;rsquo;s read only (for now). I don&amp;rsquo;t want to rewrite the whole Jira UI. I just want to improve the UX for things I do a lot.&lt;/p&gt;

&lt;p&gt;Here are some things I often want to do quickly:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Finding tickets currently assigned to me.&lt;/li&gt;
&lt;li&gt;Finding tickets that I &lt;em&gt;created&lt;/em&gt; but am not &lt;em&gt;assigned to&lt;/em&gt;.&lt;/li&gt;
&lt;li&gt;Finding all the tickets I&amp;rsquo;ve &lt;em&gt;commented on&lt;/em&gt;.&lt;/li&gt;
&lt;li&gt;Finding tickets that I have &lt;em&gt;closed&lt;/em&gt; in the past.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;It&amp;rsquo;s a lot more common for me to &lt;em&gt;look something up in Jira&lt;/em&gt; than to &lt;em&gt;change something in Jira&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;Consider:
- &amp;ldquo;What was that ticket I just finished?&amp;rdquo;
- &amp;ldquo;Where&amp;rsquo;s the devops ticket I recently opened?&amp;rdquo;
- &amp;ldquo;What ticket number should I reference in my next Git commit?&amp;rdquo;&lt;/p&gt;

&lt;p&gt;I ask myself these questions all day.&lt;/p&gt;

&lt;p&gt;It&amp;rsquo;s more rare that I actually want to close a ticket or comment on something. I set up my viewer app so that you can click on an issue to open in your browser. Then you can use the Jira UI for editing.&lt;/p&gt;

&lt;p&gt;It&amp;rsquo;s not about rewriting the whole application. It&amp;rsquo;s just about making it suck less for my own workflow.&lt;/p&gt;

&lt;p&gt;I can&amp;rsquo;t release the source code for it at this point, alas. But I&amp;rsquo;ll just say it&amp;rsquo;s absurdly simple to write a Swift app that fetches some JSON and draws some tables in a MacOS window.&lt;/p&gt;

&lt;p&gt;I don&amp;rsquo;t usually work with Swift, but this is a &lt;em&gt;trivial&lt;/em&gt; project to set up. It took me a spare afternoon or two, &lt;em&gt;including reading a bunch of SwiftUI docs and tutorials since that&amp;rsquo;s not what I usually use at work&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;(To me, the details of the implementation aren&amp;rsquo;t interesting, since all you are doing is writing a few UI components based on Apple&amp;rsquo;s stock interface elements. Every so often, I ran into &amp;ldquo;how do you update a value without triggering re-rendering loops?&amp;rdquo; But I&amp;rsquo;ve done some Javascript component work in the past, so it was quick to resolve those issues.)&lt;/p&gt;

&lt;p&gt;I can&amp;rsquo;t emphasize this thought enough: &lt;strong&gt;if you hate the Jira UI, write your own client&lt;/strong&gt;. It turns a horrible experience into a fun toy project.&lt;/p&gt;

&lt;h2 id="i-still-hate-jira"&gt;I still hate Jira&lt;/h2&gt;

&lt;p&gt;I still hate Jira. But now, every time I open up my own little Jira viewer, I feel content.&lt;/p&gt;

&lt;p&gt;Most problems in the world can&amp;rsquo;t be solved with code. Which makes it even nicer to solve the ones that can.&lt;/p&gt;
</content>
  </entry>
  <entry>
    <title>Two years in enterprise software</title>
    <link rel="alternate" href="/tech/2023/07/two-years-in-enterprise-software.html"/>
    <id>/tech/2023/07/two-years-in-enterprise-software.html</id>
    <published>2023-07-21T23:13:00+00:00</published>
    <updated>2023-07-21T23:13:00+00:00</updated>
    <author>
      <name>Eli Thorkelson</name>
    </author>
    <summary type="html">&lt;p&gt;A few years ago I decided I wanted to work in a big tech company. I thought it would be an interesting experience, and I wanted to work in a place with more technically advanced colleagues.&lt;/p&gt;

&lt;p&gt;So now I work in a big tech company that makes software for...&lt;/p&gt;</summary>
    <content type="html">&lt;p&gt;A few years ago I decided I wanted to work in a big tech company. I thought it would be an interesting experience, and I wanted to work in a place with more technically advanced colleagues.&lt;/p&gt;

&lt;p&gt;So now I work in a big tech company that makes software for other big organizations. What is this like?&lt;/p&gt;

&lt;p&gt;I&amp;rsquo;ll limit myself to discussing some broad organizational patterns. Obviously it&amp;rsquo;s very delicate to write about one&amp;rsquo;s work life, and I&amp;rsquo;m not going to say anything that&amp;rsquo;s specific to the org, the tech stack, or the business. I&amp;rsquo;m just going to discuss some organizational dynamics that, from what I can tell, are similar at many large tech companies.&lt;/p&gt;

&lt;h2 id="life-in-teams"&gt;Life in teams&lt;/h2&gt;

&lt;p&gt;It&amp;rsquo;s the first place I&amp;rsquo;ve worked where everything is organized into &amp;ldquo;teams.&amp;rdquo; If there are 10 or 3 software engineers in your whole organization, you don&amp;rsquo;t need to divide things up into teams; you can just be the &amp;ldquo;engineering department.&amp;rdquo; Come to think of it, I used to work in places where I did &amp;ldquo;programming&amp;rdquo; instead of &amp;ldquo;software engineering,&amp;rdquo; and the phrase &amp;ldquo;software engineering&amp;rdquo; itself used to sound weird to me. Somehow, I got used to it.&lt;/p&gt;

&lt;p&gt;So: you have a team; you belong to it; you keep it going; you hope to improve it&amp;hellip; But you might also change teams, or get reorganized into some new structure at any moment. A team is both necessary and ephemeral. The idea of a team hints at sports, although I guess it&amp;rsquo;s also short for &amp;ldquo;scrum teams.&amp;rdquo; As &lt;a href="https://www.atlassian.com/agile/scrum"&gt;Atlassian puts it&lt;/a&gt;, &amp;ldquo;A scrum team is a small and nimble team dedicated to delivering committed product increments.&amp;rdquo;&lt;/p&gt;

&lt;p&gt;Some weeks feel more nimble than others.&lt;/p&gt;

&lt;h2 id="life-outside-teams"&gt;Life outside teams&lt;/h2&gt;

&lt;p&gt;We&amp;rsquo;re all in teams, but we&amp;rsquo;re constantly working with other teams.&lt;/p&gt;

&lt;p&gt;Because it is a big organization, the costs of coordination across teams are relatively high. Information can travel slowly from one place to another. Sometimes you meet other teams and find out that they have very different assumptions about how the world works. Here it&amp;rsquo;s handy to be trained as an anthropologist — it makes it easier to &lt;em&gt;expect difference&lt;/em&gt; instead of expecting cognitive similarity across contexts.&lt;/p&gt;

&lt;p&gt;You can sometimes start to feel slightly isolated in a big organization divided into so many silos. Some people get news faster than others, and you aren&amp;rsquo;t necessarily first to hear if you are an individual contributor. I often think of Zane Bitter&amp;rsquo;s excellent essay, &lt;a href="https://www.zerobanana.com/essays/living-in-the-future/"&gt;Senior Engineers are Living in the Future&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;In this environment, I&amp;rsquo;ve tried to get good at listening to the faint incoming signals from other planets in our universe, since they often bring essential information.&lt;/p&gt;

&lt;h2 id="social-relationships"&gt;Social relationships&lt;/h2&gt;

&lt;p&gt;In a big and complex organization, human relationships are surprisingly important.&lt;/p&gt;

&lt;p&gt;When I first got there, everything was confusing, because there was so much local culture and history to assimilate. What made things easier was gradually &lt;em&gt;learning who to ask&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;Now if I have a question,I usually know someone who can help me. I know people in customer support, billing, IT, sales, implementation consulting, security, infrastructure, product management, user research, user design, technical writing, architecture, and so on.&lt;/p&gt;

&lt;p&gt;It gets dramatically easier to get things done if you know who to talk to. In that sense, relationships have a certain &lt;em&gt;usefulness&lt;/em&gt;. You could even call them an &amp;ldquo;asset,&amp;rdquo; although I find that term a bit dehumanizing.&lt;/p&gt;

&lt;p&gt;In any case, it&amp;rsquo;s a much friendlier place to work when you have more people to talk to.&lt;/p&gt;

&lt;h2 id="technical-specialization"&gt;Technical specialization&lt;/h2&gt;

&lt;p&gt;I&amp;rsquo;m much more specialized than I used to be. I don&amp;rsquo;t write front end code anymore. I don&amp;rsquo;t have server admin accounts or write deployment scripts. I don&amp;rsquo;t usually get the alerts if the systems are broken. I never get email from our customers. (Well, maybe once, ever.) I work strictly on back end software development, focusing on one particular technical area of a particular enterprise product.&lt;/p&gt;

&lt;p&gt;The pervasiveness of specialization enables a certain kind of focus, which is the purpose of it, of course. Do one thing and do it well. And yet specialization has some funny side effects. There&amp;rsquo;s a constant risk of tunnel vision. People get invested in the minutia of code style, in a way that I never saw in small shops. We have debates about automatic code formatting rules, and then we have meta debates about how to refactor the automatic code formatting rules. People sweat over trying to standardize code interfaces and software design patterns. I see a lot of enterprise style code, by which I mean, code that is carved up into many tiny pieces and wrapped in many layers of abstraction. I always see that as an &lt;em&gt;aesthetic&lt;/em&gt; preference as much as anything else.&lt;/p&gt;

&lt;p&gt;I used to laugh a lot at &lt;a href="https://github.com/EnterpriseQualityCoding/FizzBuzzEnterpriseEdition"&gt;FizzBuzz Enterprise Edition&lt;/a&gt;. If you never saw it, it takes a &lt;a href="https://wiki.c2.com/?FizzBuzzTest"&gt;very simple assignment&lt;/a&gt; and overcomplicates it with sententious ceremonies, too many design patterns, and too many abstractions.&lt;/p&gt;

&lt;p&gt;Now it&amp;rsquo;s not as funny&amp;hellip; because it hits closer to home.&lt;/p&gt;

&lt;p&gt;While our teams are specialized by topic, our work can also involve a lot of role ambiguity. Some software engineers end up doing project management. Some of us end up knowing a lot about the infrastructure, even though we don&amp;rsquo;t technically work on infrastructure, because we end up needing to solve problems that don&amp;rsquo;t follow the org chart. Paradoxically, we&amp;rsquo;re very specialized, and yet we&amp;rsquo;re often working outside our specialties.&lt;/p&gt;

&lt;h2 id="architecture"&gt;Architecture&lt;/h2&gt;

&lt;p&gt;The organization needs &amp;ldquo;architecture&amp;rdquo; (and the people who specialize in it, &amp;ldquo;architects&amp;rdquo;) as the seemingly natural corollary of having so much specialization. We have such large systems that it&amp;rsquo;s hard to keep track of all of them and how we fit together. Enter the &amp;ldquo;architect,&amp;rdquo; a paradoxical role for a specialist in generalization. Or rather: a specialist in thinking about systems holistically.&lt;/p&gt;

&lt;p&gt;The career path for software engineers largely points towards becoming architects. It&amp;rsquo;s a career direction for successful software people who don&amp;rsquo;t want to become managers.&lt;/p&gt;

&lt;p&gt;In practice, our local architects are fun to talk to and very thoughtful. But sometimes it surprises me that systems thinking isn&amp;rsquo;t considered &lt;em&gt;a core competency for all software engineers&lt;/em&gt; instead of a prestigious &lt;em&gt;specialization&lt;/em&gt;.&lt;/p&gt;

&lt;h2 id="scale-and-surprises"&gt;Scale and surprises&lt;/h2&gt;

&lt;p&gt;We have more users and larger-scale architecture than I&amp;rsquo;m used to. I certainly don&amp;rsquo;t work on anything that&amp;rsquo;s &amp;ldquo;internet scale,&amp;rdquo; but our systems do have &lt;em&gt;lots&lt;/em&gt; of active users. These users produce a long stream of interesting feedback, new feature requests, and above all, a stream of new edge cases. You have to solve more edge cases when you have a larger and more demanding user base.&lt;/p&gt;

&lt;p&gt;Our attention becomes a scarce resource compared to the scale of the system.&lt;/p&gt;

&lt;p&gt;To put things in perspective: At my first full time gig, we had a small number of users, and I used to get an email every single time our code raised an exception in production. Days went by without getting those emails.&lt;/p&gt;

&lt;p&gt;Now we have a system that manages production exception reports in large volumes. They have options like &amp;ldquo;Don&amp;rsquo;t bother me again about this until it happens another 100 times.&amp;rdquo;&lt;/p&gt;

&lt;p&gt;As always, it&amp;rsquo;s very hard to write code without making lots of &lt;em&gt;assumptions&lt;/em&gt; about the world in which it will be executed. It&amp;rsquo;s impossible for software engineers to anticipate &lt;em&gt;everything&lt;/em&gt;, no matter how we try. Thus, while we have fairly rigorous testing processes, we still get surprised by what users do with our systems.&lt;/p&gt;

&lt;p&gt;Sometimes it&amp;rsquo;s unclear if it&amp;rsquo;s a bug or a new use case.&lt;/p&gt;

&lt;h2 id="jira"&gt;Jira&lt;/h2&gt;

&lt;p&gt;We use Jira a lot. &lt;em&gt;A lot.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;I do not love it.&lt;/p&gt;

&lt;p&gt;I&amp;rsquo;ll save my thoughts about it for another time.&lt;/p&gt;

&lt;h2 id="the-tech-community"&gt;The tech community&lt;/h2&gt;

&lt;p&gt;I used to feel slightly more in touch with the rest of the web technical community. If you work in small places, you are more often &lt;em&gt;allowed&lt;/em&gt; to try new things without huge barriers. You can test new libraries, new architectures or new styles without having to convince a large organization to approve it. You can probably contribute to open source projects, if that&amp;rsquo;s relevant.&lt;/p&gt;

&lt;p&gt;In a larger organization, there is a much larger &lt;em&gt;internal&lt;/em&gt; technical community, which substitutes to some extent for interaction with the larger ecosystem. It&amp;rsquo;s like living in a microclimate: it has its own weather patterns; it&amp;rsquo;s less tightly coupled to the surrounding ecosystem.&lt;/p&gt;

&lt;p&gt;I&amp;rsquo;m not saying we are totally decoupled, of course. We keep a close eye on our dependency chain. We integrate lots of new things. Lots of my colleagues read Hacker News to keep an eye on the zeitgeist.&lt;/p&gt;

&lt;p&gt;But there&amp;rsquo;s a certain turn inward just because the internal environment is, comparatively, so &lt;em&gt;large&lt;/em&gt;, and so decisive for people&amp;rsquo;s careers within the organization.&lt;/p&gt;

&lt;p&gt;Here&amp;rsquo;s a good barometer of that.&lt;/p&gt;

&lt;p&gt;I used to go to public tech conferences sometimes, not to present, just to listen and learn some new things. Now I have an internal tech conference to attend instead.&lt;/p&gt;

&lt;h2 id="working-environment"&gt;Working environment&lt;/h2&gt;

&lt;p&gt;It&amp;rsquo;s a pretty good working environment. We rarely have &amp;ldquo;drop everything for this emergency&amp;rdquo; problems. That&amp;rsquo;s much more common in agency work.&lt;/p&gt;

&lt;p&gt;My current management is fairly hands off, which I love. You get handed projects, plan timelines to deliver them, provide (lots of) project documentation, and implement. You explain your sprint plan to your teammates, and announce if your plans get blocked by something, so there is some visibility into your work plans, but it isn&amp;rsquo;t otherwise micromanaged.&lt;/p&gt;

&lt;p&gt;That being said, I do notice two things:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;The longer you are there, the more meetings you seem to end up in.&lt;/li&gt;
&lt;li&gt;The longer you&amp;rsquo;re there, the more you get pinged with unexpected questions and requests. (Not that you &lt;em&gt;must&lt;/em&gt; address them all, but you &lt;em&gt;could&lt;/em&gt;.)&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;There&amp;rsquo;s a lot more you could say about working in a big public company, but I&amp;rsquo;m trying to keep it broad strokes.&lt;/p&gt;
</content>
  </entry>
  <entry>
    <title>How to downsize a tiny web server and the services on it</title>
    <link rel="alternate" href="/tech/2023/05/downsizing-servers.html"/>
    <id>/tech/2023/05/downsizing-servers.html</id>
    <published>2023-05-09T10:24:00+00:00</published>
    <updated>2023-05-09T10:24:00+00:00</updated>
    <author>
      <name>Eli Thorkelson</name>
    </author>
    <summary type="html">&lt;p&gt;Until yesterday, I hosted this website on DigitalOcean. Now it’s on EC2 instead. These are a few notes about how and why.&lt;/p&gt;

&lt;h2 id="the-old-setup"&gt;The old setup&lt;/h2&gt;

&lt;p&gt;For nine years, I’ve hosted this site on the same virtual machine on DigitalOcean. It was originally Ubuntu 14...&lt;/p&gt;</summary>
    <content type="html">&lt;p&gt;Until yesterday, I hosted this website on DigitalOcean. Now it&amp;rsquo;s on EC2 instead. These are a few notes about how and why.&lt;/p&gt;

&lt;h2 id="the-old-setup"&gt;The old setup&lt;/h2&gt;

&lt;p&gt;For nine years, I&amp;rsquo;ve hosted this site on the same virtual machine on DigitalOcean. It was originally Ubuntu 14, and later upgraded to Ubuntu 18, which is now EOL.&lt;/p&gt;

&lt;p&gt;It&amp;rsquo;s always been a tiny system running NGINX. Behind NGINX, I&amp;rsquo;ve hosted a bunch of different web services. There were several WordPress sites, a short-lived Drupal site, and even some raw PHP scripts. There were a couple of toy Ruby on Rails projects. They all had hosting configuration, SSL certs, their own Linux users, their own databases (MySQL, SQLite), their own server processes (Unicorn, PHP-FPM). Most of them have been decommissioned, and along the way, the system got full of the debris of old projects.&lt;/p&gt;

&lt;p&gt;(It&amp;rsquo;s great to have a sandbox. Highly recommended.)&lt;/p&gt;

&lt;p&gt;The server itself was originally the classic $5 DigitalOcean droplet. It has 1gb of RAM and enough disk space for my projects. I also paid an extra 20% for automatic backups, just in case.&lt;/p&gt;

&lt;p&gt;But recently, DigitalOcean raised all the prices by 20%, so I started to wonder: &lt;strong&gt;Is $7.20/month really the best I can do for basic Linux web hosting?&lt;/strong&gt;&lt;/p&gt;

&lt;h2 id="tiny-web-servers-on-aws-ec2"&gt;Tiny web servers on AWS EC2&lt;/h2&gt;

&lt;p&gt;I like having a Linux web server to play with. I don&amp;rsquo;t want to go serverless. I don&amp;rsquo;t want to move all my static sites to S3 and my remaining back end services to AWS Lambda. And while DigitalOcean isn&amp;rsquo;t the absolute cheapest option for a cheap linux VPS, most of the competition is not a lot cheaper. (And DigitalOcean has been flawless for my use case, with a nice user interface and excellent service, so that&amp;rsquo;s worth something too.)&lt;/p&gt;

&lt;p&gt;&lt;em&gt;But&lt;/em&gt; then I noticed that you can get a t4g.nano EC2 instance for $22/year, which is 69% cheaper than DigitalOcean. The big tradeoff is that it&amp;rsquo;s 500mb of RAM (and an expectation of very low average load, which is probably fine for me 🤞🏼).&lt;/p&gt;

&lt;p&gt;So I thought it over and decided that the cost savings was worthwhile. And maybe I could clean up the cruft from my old server while I was at it.&lt;/p&gt;

&lt;h2 id="notes-on-amazon-linux-2023"&gt;Notes on Amazon Linux 2023&lt;/h2&gt;

&lt;p&gt;I spun up a new EC2 instance with Amazon Linux and moved all my static sites there. Here are a few notes on that experience.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;It &lt;a href="https://github.com/amazonlinux/amazon-linux-2023/issues/300"&gt;does not have Cron&lt;/a&gt;. I was kind of amazed by this, since Cron seems like a bedrock part of Unix-like systems. The Amazon Linux devs feel that you can just use &lt;a href="https://www.freedesktop.org/software/systemd/man/systemd.timer.html"&gt;systemd timers&lt;/a&gt; instead. I only need this functionality for renewing letsencrypt certs, so I followed &lt;a href="https://stevenwestmoreland.com/2017/11/renewing-certbot-certificates-using-a-systemd-timer.html"&gt;Steven Westmoreland&amp;rsquo;s handy instructions&lt;/a&gt; to set that up.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;The package manager is annoying compared to Ubuntu. There&amp;rsquo;s churn from one Amazon Linux version to the next, so some of the docs are useless, and some basic packages are unavailable. In particular, &lt;a href="https://certbot.eff.org/"&gt;Certbot&lt;/a&gt; seems like it probably should be present in every Linux package manager by this point. I had to install it with &lt;code&gt;pip&lt;/code&gt;, which is a &lt;a href="https://certbot.eff.org/instructions?ws=nginx&amp;amp;os=pip"&gt;barely-supported official installation approach&lt;/a&gt;. I guess Amazon just doesn&amp;rsquo;t want to encourage using Letsencrypt.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Other than those minor details, it was pretty straightforward to get a new system set up and running. It isn&amp;rsquo;t exactly the same as Ubuntu, but the differences aren&amp;rsquo;t consequential.&lt;/p&gt;

&lt;h2 id="from-ruby-to-go"&gt;From Ruby to Go&lt;/h2&gt;

&lt;p&gt;It was easy to migrate the static HTML sites from one server to another. You just create the right web root directories, copy the NGINX vhost configuration from the old server (with a few updates), and point the deploy scripts to a new place.&lt;/p&gt;

&lt;p&gt;My static sites are all built by &lt;a href="https://middlemanapp.com/"&gt;Middleman&lt;/a&gt;, so the deploy scripts are very simple, like this:&lt;/p&gt;

&lt;pre&gt;&lt;code class="bash"&gt;#!/bin/bash

bundle exec middleman build
rsync -rzvu build/ webserver:/path/to/webroot
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;But I do have one project left over that isn&amp;rsquo;t a static site. It&amp;rsquo;s a tiny Rails app that shows a little &lt;a href="https://rip-my-academic-career.decasia.org"&gt;art project about leaving academia&lt;/a&gt;. It did some server side rendering for a few templates, accepted user input, and updated records in a SQLite database. Using Rails made it very quick to put together (it was the classic &amp;ldquo;optimize for dev time&amp;rdquo; approach).&lt;/p&gt;

&lt;p&gt;The problem is, even a tiny Rails app takes a couple of hundred megabytes of RAM to run. It&amp;rsquo;s not a good choice for a very tiny, low resource web server.&lt;/p&gt;

&lt;p&gt;So I decided to rewrite my Rails app like this:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;All the front end code (HTML/CSS/Javascript) was moved into a new React app. I&amp;rsquo;ve never used React before, but it was easy to get going and spin up some basic components. I didn&amp;rsquo;t try to write a single page app; I just used React to render components on top of static HTML files served by NGINX. The HTML/CSS mostly stayed the same; the Javascript had to get rewritten, but it was fun to do that.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;All the back end code was rewritten in Go, which is a language I have never touched before. I was looking for a language that could offer good performance and low resource use, but something with a better developer experience than writing a web service in C. I looked a little bit at Rust and Go; Go was the obvious winner for my use case. It turned out to take less than 500 lines of Go to spin up a basic web service that could accept form submissions, update a SQLite database, and write a JSON file that provided data to the React user interface.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;I felt pleased with myself for figuring out how to write a Systemd service that would deploy my app properly, complete with logging and configuration in env vars. I even wrote a handy deployment script that builds for ARM64 (because my EC2 instance runs on ARM), copies the compiled application to the server, stops the service, copies the compiled application to the right place on the filesystem, and restarts the service.&lt;/p&gt;

&lt;p&gt;The new Go back end service runs in 13 MB of memory instead of 97MB, and is easier to manage (because it has fewer moving parts and a simpler deployment process).&lt;/p&gt;

&lt;h2 id="migrating-dns"&gt;Migrating DNS&lt;/h2&gt;

&lt;p&gt;One of the worst parts about leaving DigitalOcean was leaving behind their excellent interface for updating DNS records.&lt;/p&gt;

&lt;p&gt;I changed nameservers for my domain, and I moved all my DNS records over to the DNS system provided by my domain registrar (currently NameCheap). But I was pretty frustrated with how much less nice their interface was compared to DO. It required some downtime for the site as well, basically because of the bad UI choices (no way to provide structured data, or to provide configuration &lt;em&gt;before&lt;/em&gt; changing the nameserver setting).&lt;/p&gt;

&lt;p&gt;This is just my private website, so it was OK. But in a medium-sized org, I&amp;rsquo;d think you would probably &lt;em&gt;not&lt;/em&gt; want to ever switch nameservers if you could possibly help it, or at least you would use DNS services from someone who provided a much better admin experience.&lt;/p&gt;

&lt;h2 id="was-this-worth-it"&gt;Was this worth it?&lt;/h2&gt;

&lt;p&gt;If I&amp;rsquo;m being honest, probably the cost savings ($50/year) was not worth the several evenings I spent moving everything around and writing a new Go application. If I really counted up the hours to move everything over and write two new projects (golang/React), even though it was a relatively quick and straightforward project, I would probably bill someone at least a few thousand dollars at professional software rates. Probably a lot more if it were a full fledged consulting project.&lt;/p&gt;

&lt;p&gt;But money aside, it feels excellent to learn new things. It&amp;rsquo;s great to explore new hosting options, new infrastructure, new programming languages. So &amp;hellip; yes it was worth it; but the value was more intellectual than economic.&lt;/p&gt;
</content>
  </entry>
  <entry>
    <title>Just where do env vars come from?</title>
    <link rel="alternate" href="/tech/2023/03/where-do-env-vars-come-from.html"/>
    <id>/tech/2023/03/where-do-env-vars-come-from.html</id>
    <published>2023-03-06T20:16:00+00:00</published>
    <updated>2023-03-06T20:16:00+00:00</updated>
    <author>
      <name>Eli Thorkelson</name>
    </author>
    <summary type="html">&lt;p&gt;Famously, Linux processes accept an array of arguments at start time. In C, this looks like &lt;code&gt;int main(int argc, char *argv[])&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;But as we all learn sometime after writing hello world for the first time, these arguments aren’t the &lt;em&gt;only&lt;/em&gt; arguments passed...&lt;/p&gt;</summary>
    <content type="html">&lt;p&gt;Famously, Linux processes accept an array of arguments at start time. In C, this looks like &lt;code&gt;int main(int argc, char *argv[])&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;But as we all learn sometime after writing hello world for the first time, these arguments aren&amp;rsquo;t the &lt;em&gt;only&lt;/em&gt; arguments passed to your program at startup. There&amp;rsquo;s also a second set of arguments, termed the &lt;em&gt;environment&lt;/em&gt;. These are the things we know colloquially as &amp;ldquo;env vars.&amp;rdquo;&lt;/p&gt;

&lt;pre&gt;&lt;code class="bash"&gt;# Passing into argv:
$ my_program --name emma

# Passing into an env var:
$ NAME=&amp;quot;emma&amp;quot; my_program
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;(You can access them in C with &lt;code&gt;getenv&lt;/code&gt; but you can also declare them &lt;a href="https://www.gnu.org/software/libc/manual/html_node/Program-Arguments.html"&gt;as an argument to main&lt;/a&gt;, like &lt;code&gt;int main(int argc, char *argv[], char *envp[]&lt;/code&gt;. This makes them available as a local variable.)&lt;/p&gt;

&lt;p&gt;Environment variables are a complex system of their own. In an organization like mine, managing environment variables is a huge endeavor.&lt;/p&gt;

&lt;p&gt;I started to get curious: What is an environment, technically speaking? And where does the environment come from?&lt;/p&gt;

&lt;h2 id="data-structure"&gt;Data structure&lt;/h2&gt;

&lt;p&gt;The arguments are an array (an ordered list), whereas the &lt;em&gt;environment&lt;/em&gt; sometimes acts like a dictionary.&lt;/p&gt;

&lt;p&gt;Until I wrote this post, I imagined that in Ruby, the environment was literally a hash (&lt;a href="https://ruby-doc.org/core-2.5.0/ENV.html"&gt;ENV&lt;/a&gt;). It turns out that no, Ruby just wraps the OS&amp;rsquo;s environment implementation in a &lt;a href="https://github.com/ruby/ruby/blob/master/hash.c#L6265"&gt;hash-like interface&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Generally speaking, in unix-like systems, the environment variables are not implemented as a hash table. They are just an unordered array of null-terminated strings, where each string is a key-value pair combined with the character &lt;code&gt;=&lt;/code&gt;. (Therefore, you can&amp;rsquo;t use the &lt;code&gt;=&lt;/code&gt; character in an env var name, though it is perfectly valid as part of the value.) The final item in the list of env vars is a null pointer.&lt;/p&gt;

&lt;p&gt;You can then use some common accessor functions &lt;a href="https://www.gnu.org/software/libc/manual/html_node/Environment-Access.html"&gt;provided by glibc&lt;/a&gt;. The most important ones are &lt;a href="https://github.com/lattera/glibc/blob/master/stdlib/setenv.c"&gt;setenv&lt;/a&gt;, &lt;a href="https://github.com/lattera/glibc/blob/master/stdlib/putenv.c"&gt;putenv&lt;/a&gt;, &lt;a href="https://github.com/lattera/glibc/blob/master/stdlib/getenv.c"&gt;getenv&lt;/a&gt;. Setenv can only add or update env vars, while putenv can also remove them, and getenv is obvious.&lt;/p&gt;

&lt;p&gt;Every time you get an env variable from glibc, it does a linear search through the current list of env vars to find a match. (For a slight performance boost, the current implementation filters by the first two characters before doing a full string comparison.)&lt;/p&gt;

&lt;h2 id="are-environment-variables-part-of-the-operating-system"&gt;Are environment variables part of the operating system?&lt;/h2&gt;

&lt;p&gt;I started to wonder: Are environment variables a fundamental feature of the Linux kernel? Are they part of the definition of a process? Are env variables handled by the system task manager?&lt;/p&gt;

&lt;p&gt;Answer: Not really. If you look at what&amp;rsquo;s stored in the Linux kernel for each process, it doesn&amp;rsquo;t contain anything like a list of environment variables. (At least that&amp;rsquo;s how it looks to me from taking a glance at &lt;code&gt;task_struct&lt;/code&gt;, the &lt;a href="https://github.com/torvalds/linux/blob/master/include/linux/sched.h#L737"&gt;kernel data structure that represents a process&lt;/a&gt;.)&lt;/p&gt;

&lt;p&gt;&lt;em&gt;However&lt;/em&gt;, it turns out that the environment is part of the &lt;em&gt;calling conventions&lt;/em&gt; of program execution in Unix systems. For example, it&amp;rsquo;s common to use a Linux system interface called &lt;code&gt;execve&lt;/code&gt; to execute new programs. (&lt;code&gt;execve&lt;/code&gt; is &lt;a href="https://github.com/bminor/bash/blob/master/execute_cmd.c#L5948"&gt;what the Bash shell uses&lt;/a&gt; to execute a command.) And when you call &lt;code&gt;execve&lt;/code&gt;, you must pass the environment variables as an argument: &lt;code&gt;int execve(const char *pathname, char *const argv[], char *const envp[])&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Thus, Linux absolutely does expect that every new process will be invoked with environment variables (even if the environment variables are an empty array). The environment variables aren&amp;rsquo;t used for process management by the kernel; they are just provided to your program as part of the program data (&lt;a href="https://github.com/torvalds/linux/blob/master/fs/exec.c#L491"&gt;stored on the stack&lt;/a&gt;). You can then use that data for anything you want.&lt;/p&gt;

&lt;aside class="blog-aside"&gt;

&lt;p class="aside-title"&gt;A bit of history&lt;/p&gt;

&lt;p&gt;It turns out that env vars were &lt;a href="https://en.wikipedia.org/wiki/Environment_variable"&gt;originally implemented in 1979&lt;/a&gt; in &lt;a href="https://en.wikipedia.org/wiki/Version_7_Unix"&gt;Unix Version 7&lt;/a&gt;, using the same technical design that Linux still uses today. John Mashey, who worked on the initial implementation, provides a handy history of this feature in a &lt;a href="https://groups.google.com/g/comp.std.c/c/V2n8OkpfAkM/m/piKGMCr2QEUJ?pli=1"&gt;comp.std.c post from 2004&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Mashey confirms my intuition that fundamentally, env vars are equivalent to an additional class of program arguments:&lt;/p&gt;
&lt;p&gt;&lt;code&gt;"7th Edition UNIX environment variables were really thought of as convenient, usually-hidden extra arguments, with no more read-onlyness than regular arguments."&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;See also the original implementation of getenv.c (&lt;a href="https://github.com/v7unix/v7unix/blob/master/v7/usr/src/libc/gen/getenv.c"&gt;GitHub&lt;/a&gt;, &lt;a href="https://www.tuhs.org/cgi-bin/utree.pl?file=V7/usr/src/libc/gen/getenv.c"&gt;tuhs.org&lt;/a&gt;).&lt;/p&gt;

&lt;/aside&gt;

&lt;h2 id="where-does-the-environment-come-from"&gt;Where does the environment come from?&lt;/h2&gt;

&lt;p&gt;One of the things you learn as a working software developer is that &lt;em&gt;usually&lt;/em&gt; the env vars are inherited from the parent process by default. Of course, the environment can be modified when you invoke the child process, but it&amp;rsquo;s often the case that, for instance, the &lt;code&gt;PATH&lt;/code&gt; and other crucial env vars are propagated down through the process tree, unchanged unless you explicitly change them. There is kind of an implicit tree of env vars, starting at a parent process and propagating across all the child processes.&lt;/p&gt;

&lt;p&gt;This being said, there are plenty of special cases where the child environment is reset to blank. Most often that would be for security reasons of one kind or another. As a result, environment variables aren&amp;rsquo;t &lt;em&gt;really&lt;/em&gt; a tree structure as a result; they are a sort of broken tree, logically speaking.&lt;/p&gt;

&lt;p&gt;This being said, we can still try to follow the tree up as far as we can. The question becomes: where does our environment get its initial state?&lt;/p&gt;

&lt;aside class="blog-aside"&gt;

&lt;p class="aside-title"&gt;On reading Linux codebases&lt;/p&gt;

&lt;p&gt;I feel I should issue a caveat here that I have no deep expertise in low-level Linux software development. The comments that follow are gleaned from my own browsing around through some of these codebases (&lt;code&gt;glibc&lt;/code&gt;, &lt;code&gt;openssh&lt;/code&gt;, &lt;code&gt;util-linux&lt;/code&gt;, &lt;code&gt;systemd&lt;/code&gt;), but I would welcome corrections from more expert developers.&lt;/p&gt;

&lt;/aside&gt;

&lt;h3 id="1-the-shell"&gt;1. The shell&lt;/h3&gt;

&lt;p&gt;We often run Linux programs through a &lt;em&gt;shell&lt;/em&gt;. Thus when you invoke a process, it&amp;rsquo;s common to get the initial set of env vars from the shell. You might customize these env vars in your shell configuration, typically with &lt;code&gt;export FOO=&amp;quot;bar&amp;quot;&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;A shell like Bash has its own variable handling system (&lt;a href="https://github.com/bminor/bash/blob/master/variables.c"&gt;bash:variables.c&lt;/a&gt;) that&amp;rsquo;s separate from the glibc environment handling system. But this variable handling system is itself initialized from the parent environment in &lt;a href="https://github.com/bminor/bash/blob/74091dd4e8086db518b30df7f222691524469998/variables.c#L359"&gt;#initialize_shell_variables&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;So where does your shell session get its initial env vars from?&lt;/p&gt;

&lt;h3 id="2-sshd-login"&gt;2. sshd/login&lt;/h3&gt;

&lt;p&gt;The shell gets its env vars from &lt;em&gt;its&lt;/em&gt; parent process. If you log in from a console, your shell will be spawed by a process called &lt;code&gt;login&lt;/code&gt; (the process that checks your credentials and then invokes your designated shell process). If you log in with SSH, your shell will be spawned by the &lt;code&gt;sshd&lt;/code&gt; process.&lt;/p&gt;

&lt;p&gt;OpenSSH provides a function called &lt;code&gt;do_setup_env&lt;/code&gt; that initializes the basic environment variables before loading your shell. These would include HOME, USER, SHELL, TERM, and PATH (see &lt;a href="https://github.com/openssh/openssh-portable/blob/23b8cb41767af99a1aac24589d1882d9c8c2c205/session.c#L1023"&gt;openssh-portable:session.c&lt;/a&gt;). The analogous function in &lt;code&gt;login&lt;/code&gt; would be &lt;code&gt;init_environ&lt;/code&gt;, which does similar operations (see &lt;a href="https://github.com/util-linux/util-linux/blob/master/login-utils/login.c#L1180"&gt;util-linux:login-utils/login.c&lt;/a&gt;).&lt;/p&gt;

&lt;p&gt;But if you read the code, you&amp;rsquo;ll see that the sshd process also propagates its own env vars into the child shell processes. Where do &lt;em&gt;those&lt;/em&gt; env vars come from?&lt;/p&gt;

&lt;h3 id="3-init"&gt;3. Init&lt;/h3&gt;

&lt;p&gt;All processes in Linux descend from an &lt;code&gt;init&lt;/code&gt; process, which has PID 1, and is the parent of all other processes. On systems I use, the init process is generally &lt;a href="https://systemd.io/"&gt;systemd&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;It looks to me like systemd builds the initial env for a child process from several sources in &lt;a href="https://github.com/systemd/systemd/blob/main/src/core/execute.c#L4806"&gt;systemd:src/core/execute.c&lt;/a&gt;.&lt;/p&gt;

&lt;pre&gt;&lt;code class="c"&gt;accum_env = strv_env_merge(params-&amp;gt;environment,
   our_env,
   joined_exec_search_path,
   pass_env,
   context-&amp;gt;environment,
   files_env);
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;The &lt;a href="https://www.freedesktop.org/software/systemd/man/systemd.exec.html#Environment%20Variables%20in%20Spawned%20Processes"&gt;systemd man page has more details&lt;/a&gt; on what those different sources are. When running services like sshd, systemd usually prefers to spawn new processes with a blank environment (except for env vars configured for that specific service). But when running interactive user programs, systemd will generally pass through its own environment vars by default. (See &lt;a href="https://github.com/systemd/systemd/blob/1016c8ad94976d7b211a2f002aafac19493823c7/src/core/manager.c#L682"&gt;systemd:src/core/manager.c#manager_default_environment&lt;/a&gt;.)&lt;/p&gt;

&lt;p&gt;If even systemd has environment variables, just like every other process, then where do &lt;em&gt;those&lt;/em&gt; come from?&lt;/p&gt;

&lt;h3 id="4-the-linux-kernel"&gt;4. The Linux kernel&lt;/h3&gt;

&lt;p&gt;In the end, they have to come from the kernel. There&amp;rsquo;s nowhere else at this point, right?&lt;/p&gt;

&lt;p&gt;The init process is invoked via the very simple function &lt;a href="https://github.com/torvalds/linux/blob/master/init/main.c#L1426"&gt;run_init_process&lt;/a&gt;. It executes the init process with execve, using a provided set of argv and envp values:&lt;/p&gt;

&lt;pre&gt;&lt;code class="c"&gt;kernel_execve(init_filename, argv_init, envp_init);
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;What is the value of &lt;code&gt;envp_init&lt;/code&gt; here?&lt;/p&gt;

&lt;p&gt;In &lt;code&gt;linux:init/main.c&lt;/code&gt;, we finally find the most basic default values for &lt;a href="https://github.com/torvalds/linux/blob/master/init/main.c#L197"&gt;envp_init&lt;/a&gt;. They are the following:&lt;/p&gt;

&lt;pre&gt;&lt;code class="c"&gt;HOME=&amp;quot;/&amp;quot;
TERM=&amp;quot;linux&amp;quot;
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;There you are: the default env vars set for a linux system. They&amp;rsquo;re pretty useless, honestly.&lt;/p&gt;

&lt;p&gt;(These values have long since been overwritten by the time you log in with SSH. In practice, sshd is the top level source of env vars for your interactive sessions with remote systems.)&lt;/p&gt;

&lt;h3 id="5-arguments-to-the-kernel"&gt;5. Arguments to the kernel&lt;/h3&gt;

&lt;p&gt;But there&amp;rsquo;s one last funny detail. It turns out that if you pass env var-like arguments &lt;em&gt;into the kernel as arguments at boot time&lt;/em&gt; (&lt;a href="https://www.freedesktop.org/software/systemd/man/kernel-command-line.html#"&gt;docs&lt;/a&gt;), they will magically become env vars appended to the default &lt;code&gt;envp_init&lt;/code&gt; values, and then they will be passed down into the &lt;code&gt;init&lt;/code&gt; process (see &lt;a href="https://github.com/torvalds/linux/blob/master/init/main.c#L534"&gt;unknown_bootoption&lt;/a&gt;).&lt;/p&gt;

&lt;p&gt;So in the end, the very distinction between env vars and arguments breaks down. &lt;code&gt;argc&lt;/code&gt; can magically become &lt;code&gt;envp&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;It&amp;rsquo;s unintuitive, but if you think about it, there&amp;rsquo;s no hard categorical distinction between args and env vars in the first place. You can pass values into your program either way, with only minor adjustments to your code. The distinction between the two is largely a matter of convention and semantics.&lt;/p&gt;

&lt;p&gt;An &amp;ldquo;environment&amp;rdquo; is a fundamentally complex thing. It makes sense to me that there&amp;rsquo;s something arbitrary about how we represent it.&lt;/p&gt;
</content>
  </entry>
  <entry>
    <title>My first day using Docker</title>
    <link rel="alternate" href="/tech/2022/11/docker-first-day.html"/>
    <id>/tech/2022/11/docker-first-day.html</id>
    <published>2022-11-29T21:53:00+00:00</published>
    <updated>2022-11-29T21:53:00+00:00</updated>
    <author>
      <name>Eli Thorkelson</name>
    </author>
    <summary type="html">&lt;p&gt;My current org uses Docker containers heavily in our development environment. For the most part, back end engineers rarely configure the containerized environment. We have other groups that do that for us. There’s a development infrastructure group...&lt;/p&gt;</summary>
    <content type="html">&lt;p&gt;My current org uses Docker containers heavily in our development environment. For the most part, back end engineers rarely configure the containerized environment. We have other groups that do that for us. There&amp;rsquo;s a development infrastructure group, which overlaps somewhat with the larger infrastructure group.&lt;/p&gt;

&lt;p&gt;I get what containers are good for — they get us standardized, repeatable, isolable environments. They make it much easier to keep our development environment in sync with our production infrastructure. And they are a step up, in many ways, from the way I used to do this. The old way was just &amp;ldquo;Install the development environment and all its dependencies on my workstation,&amp;rdquo; which gets old fast, and scales poorly.&lt;/p&gt;

&lt;p&gt;Anyway, this week I wanted to set up a brand new demo environment, so I decided to learn Docker from scratch.&lt;/p&gt;

&lt;p&gt;It took about 6 hours start to finish, including learning how to write a FastCGI process in Ruby. Basically I built a demo project with one NGINX web server container and two back-end application server containers (one running Puma, one running a FastCGI process). Then I used it for some performance testing I wanted to do.&lt;/p&gt;

&lt;p&gt;So these are just some notes on getting started with Docker and Docker Compose.&lt;/p&gt;

&lt;h2 id="how-do-you-learn-your-way-around-docker"&gt;How do you learn your way around Docker?&lt;/h2&gt;

&lt;p&gt;For what it&amp;rsquo;s worth, this was pretty much my approach:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Google &amp;ldquo;how to create a Docker container.&amp;rdquo;&lt;/li&gt;
&lt;li&gt;Figure out which of the existing docs were actually worth reading (reliable, comprehensive, readable, current).&lt;/li&gt;
&lt;li&gt;Set out to create the most basic possible Docker environment: an NGINX container that displayed the default homepage.&lt;/li&gt;
&lt;li&gt;Create a project folder on a Linux dev box that already had Docker tooling installed.&lt;/li&gt;
&lt;li&gt;Make a basic &lt;code&gt;docker-compose.yml&lt;/code&gt; file with one service defined.&lt;/li&gt;
&lt;li&gt;Browse around in our existing work repos to find a suitable base image for the container.&lt;/li&gt;
&lt;li&gt;Try a command like &lt;code&gt;docker-compose up&lt;/code&gt; in my project folder.&lt;/li&gt;
&lt;li&gt;Watch it build.&lt;/li&gt;
&lt;li&gt;Log into the container using &lt;code&gt;docker-compose run&lt;/code&gt; or &lt;code&gt;docker exec -it [container] sh&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Install &lt;code&gt;bash&lt;/code&gt; inside the container, because &lt;code&gt;sh&lt;/code&gt; was mediocre.&lt;/li&gt;
&lt;li&gt;Install &lt;code&gt;vim&lt;/code&gt; to be able to edit the NGINX configuration interactively.&lt;/li&gt;
&lt;li&gt;Figure out how to generate a custom Docker image, by writing a &lt;code&gt;Dockerfile&lt;/code&gt;, which added custom packages and configuration to a given base image. (Learned that docker-compose is for orchestration containers at runtime, while a Dockerfile governs image building.)&lt;/li&gt;
&lt;li&gt;Fiddle around with NGINX configuration inside the container to ensure that it listened nicely on http/port 80.&lt;/li&gt;
&lt;li&gt;Learn that you can use &lt;code&gt;nginx -s reload&lt;/code&gt; to live-reload the running NGINX settings without restarting the container.&lt;/li&gt;
&lt;li&gt;Read the Docker docs to figure out how to expose a container (on a certain port) to the host. Use port mapping.&lt;/li&gt;
&lt;li&gt;Restart the containerized environment and check that you see the NGINX default homepage at &lt;code&gt;http://localhost:8088&lt;/code&gt; (let&amp;rsquo;s say &lt;code&gt;8088&lt;/code&gt; was the port on the host that pointed to port 80 in the container).&lt;/li&gt;
&lt;li&gt;Put my custom NGINX configuration in a file on the parent host. Use the &lt;code&gt;Dockerfile&lt;/code&gt; to copy it into the container.&lt;/li&gt;
&lt;li&gt;Rebuild the container a few times to make sure it works.&lt;/li&gt;
&lt;li&gt;Make a second &lt;code&gt;app&lt;/code&gt; service in &lt;code&gt;docker-compose.yml&lt;/code&gt;, using a Ruby 2.7.6 image we had lying around.&lt;/li&gt;
&lt;li&gt;Stumble over the question of how to do containerized development in a more exploratory way. (Containers need to have a process running at start time, but when you&amp;rsquo;re doing new development, you might not know how to start your process just yet. I put &lt;code&gt;tail -f /dev/null&lt;/code&gt; as the initial container process, after a handy stackoverflow tip.)&lt;/li&gt;
&lt;li&gt;Set up a Ruby project inside the app container with a &lt;code&gt;Gemfile&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Realize that most Ruby web server libraries will need C development tools to build. Install them from inside the container (&lt;code&gt;gcc&lt;/code&gt;, &lt;code&gt;build-essential&lt;/code&gt;, etc).&lt;/li&gt;
&lt;li&gt;Pick through some verbose &lt;code&gt;make&lt;/code&gt; output to detect other missing dependencies. Install them too.&lt;/li&gt;
&lt;li&gt;After &lt;code&gt;bundle install&lt;/code&gt; worked manually inside the container, I moved all the dependency setup and the actual &lt;code&gt;bundle install&lt;/code&gt; command into the &lt;code&gt;Dockerfile&lt;/code&gt; for my app service.&lt;/li&gt;
&lt;li&gt;Set up the &lt;code&gt;Dockerfile&lt;/code&gt; to copy my Ruby project onto the container during the build process (&lt;code&gt;COPY ...&lt;/code&gt;).&lt;/li&gt;
&lt;li&gt;Google how to set up file synchronization between a container and the host file system. I used bind mounts, which is discouraged, since you&amp;rsquo;re supposed to use virtual volumes now, but bind mount worked just fine for my case. It&amp;rsquo;s configured in &lt;code&gt;docker-compose.yml&lt;/code&gt;, as it&amp;rsquo;s a container &amp;ldquo;runtime&amp;rdquo; feature, rather than a container &amp;ldquo;build&amp;rdquo; feature.&lt;/li&gt;
&lt;li&gt;Spend some time poking around at how Docker does virtualized networking, to try to figure out how to communicate from one container to the next (since NGINX needs to be able to reach the upstream service).&lt;/li&gt;
&lt;li&gt;Try using container IP addresses to communicate (172.16.x.x), but they changed sometimes when I restarted the docker-compose environment. I couldn&amp;rsquo;t readily provision them at container build time, and it seemed hacky to pass them down to NGINX at container runtime, if that is even possible.&lt;/li&gt;
&lt;li&gt;Look in /etc/hosts on the container. Didn&amp;rsquo;t help me.&lt;/li&gt;
&lt;li&gt;Google some questions about Docker networking.&lt;/li&gt;
&lt;li&gt;Realize that I&amp;rsquo;m doing it the suboptimal (basic) way, with bridge networking mode instead of something fancier. No worries there, doesn&amp;rsquo;t matter in this case.&lt;/li&gt;
&lt;li&gt;Read something on Stackoverflow and learned that you can just use the other container&amp;rsquo;s name as a hostname. It Just Works™ because of some custom DNS setup in Docker.&lt;/li&gt;
&lt;li&gt;Update the NGINX config, rebuild the environment.&lt;/li&gt;
&lt;li&gt;OK then why does NGINX still not connect to the upstream?&lt;/li&gt;
&lt;li&gt;Oh right, the upstream web process needs to be listening on a public network interface instead of on localhost.&lt;/li&gt;
&lt;li&gt;Add the correct incantation to the Dockerfile for my &lt;code&gt;app&lt;/code&gt; container, rebuild the environment.&lt;/li&gt;
&lt;li&gt;It works! Now it&amp;rsquo;s time to add a second app container for FastCGI (the first one used Puma) and point NGINX at both of them&amp;hellip;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;In the end I had a containerized environment with an NGINX container plus two upstream containers (Puma and FastCGI).&lt;/p&gt;

&lt;p&gt;Then I was able to finish my little demo project, doing some basic performance testing for different Ruby web server processes. (In particular, I was curious about comparative memory usage for Puma, WEBrick, Unicorn and FastCGI-based back end servers. TLDR: FastCGI uses much less runtime memory than any of the alternatives.)&lt;/p&gt;

&lt;h2 id="how-i-didnt-learn-my-way-around-docker"&gt;How I didn&amp;rsquo;t learn my way around Docker&lt;/h2&gt;

&lt;p&gt;Note that I didn&amp;rsquo;t do any of these other possible strategies:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Run &lt;code&gt;man docker&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Read a technical book about Docker. (I&amp;rsquo;m sure there are good ones.)&lt;/li&gt;
&lt;li&gt;Watch videos about Docker. (I&amp;rsquo;m kind of a text-based person.)&lt;/li&gt;
&lt;li&gt;Ask a colleague for assistance. (I have lots of highly experienced colleagues in this area, but they&amp;rsquo;re all busy and it&amp;rsquo;s fun to teach myself.)&lt;/li&gt;
&lt;li&gt;Use an existing containerized environment as a point of departure, and then customize it. (I built from scratch instead).&lt;/li&gt;
&lt;li&gt;Have a completely clear plan about how the environment needed to work (e.g. networking, volume mounting). (I was OK with not knowing exactly what I was going to do, as long as it worked in general.)&lt;/li&gt;
&lt;li&gt;Use best practices for production-ready containers. (Some of the best practices are too heavy-duty for a basic use case.)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;To be clear, any of these approaches would have been valid! I just didn&amp;rsquo;t use them.&lt;/p&gt;

&lt;p&gt;I was happy with my very hands-on, iterative, solo,  approach.&lt;/p&gt;

&lt;h2 id="reflections-on-docker"&gt;Reflections on Docker&lt;/h2&gt;

&lt;p&gt;I dislike the way that Docker can become a black box in my organization, maintained by specialists even though we all use it all day. I courteously dislike that approach, because what Docker does is really just the basics of Linux-based systems administration, organized in a particular way around a particular core abstraction. I think developers should know their way around those things, even if we don&amp;rsquo;t know every detail of a complex dev environment.&lt;/p&gt;

&lt;p&gt;Anyway, once I dug into it, it wasn&amp;rsquo;t that hard to understand Docker because I already knew some basic Linux systems administration things, e.g. about networking, file systems, package management, and OS virtualization. So I just applied what I already knew to the Docker environment, trying to figure out &amp;ldquo;How do I do that here?&amp;rdquo; Once I thought of it that way, it was all relatively easy.&lt;/p&gt;

&lt;p&gt;(It helps that documentation was so easy to find, since Docker is common, well-documented technology.)&lt;/p&gt;

&lt;p&gt;I didn&amp;rsquo;t love some of the inconsistencies between Docker and Docker Compose. I guess they are technically two separate tools, but I wanted them to feel more like an integrated system, instead of having one DSL for one of them and another for the other.&lt;/p&gt;

&lt;p&gt;But I did appreciate how Docker pushes you into an ephemeral, fully declarative environment setup.* With a long running virtual Linux system, even if you use something like Ansible for initial setup, it can be tempting to make custom tweaks to a running environment, ignoring your own configuration management. It&amp;rsquo;s very hard to do this with a containerized environment; you find yourself rebuilding the containers pretty frequently. This causes you to put all the setup in the relevant &lt;code&gt;Dockerfile&lt;/code&gt;, with no cheating.&lt;/p&gt;

&lt;p&gt;(Fortunately, it&amp;rsquo;s still &lt;em&gt;possible&lt;/em&gt; to log into a container and interactively configure it. If you look at my notes above, I frequently started out with &amp;ldquo;How do I do XYZ from a shell inside the container,&amp;rdquo; and only subsequently moved the incantation into the &lt;code&gt;Dockerfile&lt;/code&gt;. This speeds up the dev feedback loop.)&lt;/p&gt;

&lt;p&gt;It was a fun afternoon of digging into this stuff, honestly. It&amp;rsquo;s not every day I learn new things.&lt;/p&gt;

&lt;hr&gt;

&lt;p&gt;&lt;em&gt;* To be precise, a Dockerfile is an imperative build script, but docker-compose wraps it in a declarative configuration system.&lt;/em&gt;&lt;/p&gt;
</content>
  </entry>
  <entry>
    <title>Were you root?</title>
    <link rel="alternate" href="/tech/2022/10/were-you-root.html"/>
    <id>/tech/2022/10/were-you-root.html</id>
    <published>2022-10-18T16:23:00+00:00</published>
    <updated>2022-10-18T16:23:00+00:00</updated>
    <author>
      <name>Eli Thorkelson</name>
    </author>
    <summary type="html">&lt;p&gt;Back when I was a teenager, and all I had was an old Macintosh to hack on, I used to think it sounded amazing to be a Unix system administrator.&lt;/p&gt;

&lt;p&gt;In particular, I was super excited about being &lt;code&gt;root&lt;/code&gt;. There was such a mystique around that user. I knew...&lt;/p&gt;</summary>
    <content type="html">&lt;p&gt;Back when I was a teenager, and all I had was an old Macintosh to hack on, I used to think it sounded amazing to be a Unix system administrator.&lt;/p&gt;

&lt;p&gt;In particular, I was super excited about being &lt;code&gt;root&lt;/code&gt;. There was such a mystique around that user. I knew it was somehow powerful. Dangerous. I found it mysterious.&lt;/p&gt;

&lt;p&gt;I was so excited about it that I tried typing &lt;code&gt;su&lt;/code&gt; once on my ISP&amp;rsquo;s Unix system. (This was back in the days when ordinary dialup accounts came with Unix shell accounts.) I obviously had no idea what the password was (and I didn&amp;rsquo;t even try to do anything remotely like &amp;ldquo;hacking&amp;rdquo;), but my failed attempt ended up in their logs.&lt;/p&gt;

&lt;p&gt;The ISP was not pleased. They disabled our account.&lt;/p&gt;

&lt;p&gt;My dad had to call up the company owner (I think they had been high school classmates; it was all very small-town) and explain that I was not going to do that again.&lt;/p&gt;

&lt;p&gt;They re-enabled our account.&lt;/p&gt;

&lt;p&gt;(And I never did that again.)&lt;/p&gt;

&lt;hr&gt;

&lt;p&gt;A few years went by. I got a job doing some web software development (mostly Python). I had more legit reasons to play with servers. I learned enough to be dangerous.&lt;/p&gt;

&lt;p&gt;One day, I was at some kind of web training and I wanted to be helpful to the event organizers. So they gave me admin access to their little Debian server and told me to make myself useful.&lt;/p&gt;

&lt;p&gt;I was trying to learn my way around their environment, but I wasn&amp;rsquo;t really too familiar with how it was laid out. I was exploring and I typed &lt;code&gt;cat /dev/something&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;(I didn&amp;rsquo;t know yet that &lt;code&gt;/dev/&lt;/code&gt; is the directory tree that Linux uses to expose system devices as if they were files. Nothing in there is actually a normal file.)&lt;/p&gt;

&lt;p&gt;The whole system froze.&lt;/p&gt;

&lt;p&gt;Whatever I had done was impossible to exit or cancel.&lt;/p&gt;

&lt;p&gt;The shell stopped responding. So did &lt;code&gt;httpd&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Sheepishly, I had to approach the event organizers and explain I had just crashed their webserver.&lt;/p&gt;

&lt;p&gt;&amp;ldquo;What did you do?&amp;rdquo; they asked.&lt;/p&gt;

&lt;p&gt;&amp;ldquo;I ran &lt;code&gt;cat&lt;/code&gt; on some file in &lt;code&gt;/dev/&lt;/code&gt;.&amp;rdquo;&lt;/p&gt;

&lt;p&gt;&amp;ldquo;Were you root?&amp;rdquo;&lt;/p&gt;

&lt;p&gt;&amp;ldquo;Yeah.&amp;rdquo;&lt;/p&gt;

&lt;p&gt;They chuckled and hit the restart button on the web server. Problem solved.&lt;/p&gt;

&lt;hr&gt;

&lt;p&gt;On that day, &lt;code&gt;root&lt;/code&gt; lost its mystique for me.&lt;/p&gt;

&lt;p&gt;It&amp;rsquo;s not exciting to break something that people depend on. It&amp;rsquo;s embarrassing.&lt;/p&gt;

&lt;p&gt;For professional tech people, admin rights on servers are often as much a burden as a privilege.&lt;/p&gt;

&lt;p&gt;If you have root and the system breaks, it&amp;rsquo;s possibly your fault and you probably have to fix it.&lt;/p&gt;

&lt;p&gt;I&amp;rsquo;m not saying that all hierarchies of access are fair or great. We should ask questions about who has power on computer systems.&lt;/p&gt;

&lt;p&gt;But sometimes&amp;hellip; the access controls are guardrails. They can keep the users from breaking things.&lt;/p&gt;

&lt;p&gt;I suppose we all probably have to learn that the hard way.&lt;/p&gt;

&lt;hr&gt;

&lt;p&gt;Around then, I also remember being in a programming class in high school. The teacher had installed some security software on the lab computers. You couldn&amp;rsquo;t modify most of the files on the system.&lt;/p&gt;

&lt;p&gt;I showed my teacher that the security software was inadequate. It only protected against deleting files through the ordinary Macintosh Finder. I demonstrated that you could delete any arbitrary file, including the security system configuration, just by writing a few lines of code.&lt;/p&gt;

&lt;p&gt;The teacher was way ahead of me. He laughed. He wasn&amp;rsquo;t alarmed.&lt;/p&gt;

&lt;p&gt;He said, &amp;ldquo;Look, Eli, the security software isn&amp;rsquo;t really there for &lt;em&gt;you&lt;/em&gt;. I just installed it to stop random kids from renaming the system files to have dirty names.&amp;rdquo;&lt;/p&gt;

&lt;p&gt;I smiled.&lt;/p&gt;

&lt;p&gt;And I see his point even better now than I did then.&lt;/p&gt;
</content>
  </entry>
  <entry>
    <title>Thoughts on URL path routing</title>
    <link rel="alternate" href="/tech/2022/10/thoughts-on-url-path-routing.html"/>
    <id>/tech/2022/10/thoughts-on-url-path-routing.html</id>
    <published>2022-10-07T00:30:00+00:00</published>
    <updated>2022-10-07T00:30:00+00:00</updated>
    <author>
      <name>Eli Thorkelson</name>
    </author>
    <summary type="html">&lt;p&gt;URL path routing is one of those things that gets more interesting the longer you think about it.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;(This post is geared toward mid-level web developers. It won’t teach anything to the NGINX core developers.)&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;By “URL path routing,” I mean the part...&lt;/p&gt;</summary>
    <content type="html">&lt;p&gt;URL path routing is one of those things that gets more interesting the longer you think about it.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;(This post is geared toward mid-level web developers. It won&amp;rsquo;t teach anything to the NGINX core developers.)&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;By &amp;ldquo;URL path routing,&amp;rdquo; I mean the part of a web server process that parses incoming HTTP requests, looks at the request path (and the HTTP verb), and ensures that your request is handled by the right handler function for that path.&lt;/p&gt;

&lt;p&gt;The first line of an HTTP request looks like this: &lt;code&gt;GET /path/to/something HTTP/1.1&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;So the question is, how does &lt;code&gt;path/to/something&lt;/code&gt; get handled?&lt;/p&gt;

&lt;p&gt;(The HTTP 1.1 spec calls this part of the request the &lt;a href="https://datatracker.ietf.org/doc/html/rfc7230#section-3.1.1"&gt;request-target&lt;/a&gt;.
I&amp;rsquo;ll call it the &amp;ldquo;request path&amp;rdquo; here, which is a term widely used in practice.)&lt;/p&gt;

&lt;h2 id="the-file-system-is-an-implicit-router"&gt;The file system is an implicit router&lt;/h2&gt;

&lt;p&gt;If you look at the most basic, old-school, static web server, path routing is largely invisible.&lt;/p&gt;

&lt;p&gt;Every URL path is mapped to a file on a server. You put files into your document root folder, and the web server serves them right back to the users. If your document root is &lt;code&gt;/var/www&lt;/code&gt;, then you can put &lt;code&gt;story.html&lt;/code&gt; into that folder, and it automatically shows up at &lt;code&gt;website.com/story.html&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;In this case, request path routing is almost a one to one map onto a directory tree. &lt;em&gt;Almost&lt;/em&gt; one to one, but not completely. There are questions to answer. Options to configure.&lt;/p&gt;

&lt;p&gt;The first special case here is this: What do we return if someone requests a directory instead of a file? Here we find our old friend &lt;code&gt;index.html&lt;/code&gt;, a convention that you see in a basic NGINX configuration like this:&lt;/p&gt;

&lt;pre&gt;&lt;code class="config"&gt;location / {
  index index.html;
  try_files $uri $uri/ =404;
}
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Which tells the NGINX &lt;a href="http://nginx.org/en/docs/http/ngx_http_index_module.html"&gt;index module&lt;/a&gt; that if you request a path ending with a slash, the &lt;code&gt;index.html&lt;/code&gt; file beneath that path will be returned. It also says that if you request &lt;code&gt;/folder&lt;/code&gt; with no trailing slash, then the server will check if &lt;code&gt;/folder/&lt;/code&gt; (the corresponding directory) is available instead.&lt;/p&gt;

&lt;p&gt;I must say that this approach to URL routing is surprisingly effective and scalable, up to a point. You can have folders and subfolders many levels deep. Maybe you can even use symlinks. I&amp;rsquo;ve seen huge archives (tens of thousands of files?) served up from this sort of configuration.&lt;/p&gt;

&lt;p&gt;Back in &lt;a href="https://eev.ee/blog/2012/04/09/php-a-fractal-of-bad-design/"&gt;PHP: a fractal of bad design&lt;/a&gt;, Eevee described filesystem routing as amounting to &amp;ldquo;No Routing,&amp;rdquo; period. But I think I would say, instead, that the file system itself is &lt;em&gt;also&lt;/em&gt; a sort of routing system, which maps file system paths to different physical or virtual storage handlers. In a sense, a static web server just delegates path handling to &lt;em&gt;another&lt;/em&gt; routing system.&lt;/p&gt;

&lt;p&gt;But file-based routing breaks down as soon as you want to respond to a request with something other than the contents of a file.&lt;/p&gt;

&lt;aside class="blog-aside"&gt;

&lt;p class="aside-title"&gt;Fun Fact&lt;/p&gt;

&lt;p&gt;Technically speaking, if you really wanted to, you could expose an arbitrary function as an ordinary Linux file. Theoretically, you could write a whole application, make it show up at &lt;code&gt;/var/www/hello_world.txt&lt;/code&gt;, and let the webserver serve it just like static text. The downside, unfortunately, is that you would have to wrap your application code in a custom kernel module, or maybe a custom device driver, or a &lt;a href="https://en.wikipedia.org/wiki/Filesystem_in_Userspace"&gt;FUSE&lt;/a&gt; integration 🙃.&lt;/p&gt;

&lt;p&gt;I have no idea why you would want to do this. It&amp;rsquo;s just fun that it is theoretically possible to put &lt;i&gt;anything&lt;/i&gt; behind a Linux file path. See &lt;a href="http://web.archive.org/web/20170808011922/https://people.cs.pitt.edu/~jmisurda/teaching/cs449/valerie-henson-device-drivers-hello.pdf"&gt;/proc/hello_world [archive.org]&lt;/a&gt; or a &lt;a href="https://www.cs.nmsu.edu/~pfeiffer/fuse-tutorial/html/index.html"&gt;FUSE tutorial&lt;/a&gt; for some options!&lt;/p&gt;

&lt;/aside&gt;

&lt;h2 id="why-do-we-route-anyway"&gt;Why do we route, anyway?&lt;/h2&gt;

&lt;p&gt;What if you want to send back to a user the result of an arbitrary function?&lt;/p&gt;

&lt;p&gt;(And, let&amp;rsquo;s say, you don&amp;rsquo;t want to do anything as silly as wrap it in a Linux kernel module.)&lt;/p&gt;

&lt;p&gt;Technically, a path routing layer is not even required for a web server. You wouldn&amp;rsquo;t need it for an application that handles every request the exact same way. The most basic &lt;a href="https://rack.github.io/"&gt;Rack application&lt;/a&gt; just looks like this:&lt;/p&gt;

&lt;pre&gt;&lt;code class="ruby"&gt;# config.ru - Version 1
run -&amp;gt; (env) { [200, {}, [&amp;quot;The meaning of life is 42\n&amp;quot;]] }
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;That&amp;rsquo;s just a single function* that always returns the same response, no matter what the input parameters. You can request any &lt;code&gt;path&lt;/code&gt; and this application will return the same output. (*&lt;em&gt;OK OK, technically speaking this is a Ruby stabby lambda, since Ruby doesn&amp;rsquo;t quite have first-class functions.&lt;/em&gt;)&lt;/p&gt;

&lt;p&gt;So now what if you want to handle two paths differently?&lt;/p&gt;

&lt;pre&gt;&lt;code class="ruby"&gt;# config.ru - Version 2
run -&amp;gt; (env) {
  case env[&amp;quot;REQUEST_PATH&amp;quot;]
  when &amp;quot;/secret&amp;quot;
    [401, {}, [&amp;quot;The real meaning of life is a secret\n&amp;quot;]]
  else
    [200, {}, [&amp;quot;Officially, the meaning of life is 42\n&amp;quot;]]
  end
}
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Now we have a handler function that inspects the input and responds differently depending what path you send it.&lt;/p&gt;

&lt;p&gt;OK, but what if you decide that this is an ugly, un-extensible bit of code? What if you refactor this?&lt;/p&gt;

&lt;p&gt;I&amp;rsquo;ve written test code that looks like this:&lt;/p&gt;

&lt;pre&gt;&lt;code class="ruby"&gt;# config.ru - Version 3
run -&amp;gt; (env) {
  case env[&amp;quot;REQUEST_PATH&amp;quot;]
  when &amp;quot;/config&amp;quot;
    config_response()
  when &amp;quot;/info&amp;quot;
    info_response()
  when &amp;quot;/normal&amp;quot;
    normal_response()
  else
    error_response()
  end
}
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Now you have a function that just maps between request paths and handler functions.&lt;/p&gt;

&lt;p&gt;Congrats, we just reinvented the use case for a routing layer! It&amp;rsquo;s just a standard way of mapping possible inputs onto handler functions. Instead of handling every request with the same function, a path router gives us a layer of indirection so we can send different requests to different places.&lt;/p&gt;

&lt;p&gt;Routing, fundamentally, is a software design pattern, a permutation of what Kwindla Hultman Kramer calls the &lt;a href="https://medium.com/@kwindla/dispatcher-pattern-examples-c950c64440ee"&gt;dispatcher pattern&lt;/a&gt;. It addresses the general problem, &lt;em&gt;How do you map a complex input set to an arbitrarily large set of possible handler functions, when a &lt;code&gt;case&lt;/code&gt; statement is inadequate?&lt;/em&gt; The specific implementations we&amp;rsquo;ll see here are all just possible solutions to this general question.&lt;/p&gt;

&lt;p&gt;You can deduce the need for a routing layer by considering two problems with Version 3.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;It&amp;rsquo;s full of hardcoded string constants and hardcoded handler names. What if we wanted to make this function configurable, so you don&amp;rsquo;t have to edit the source code every time you change the handler configuration?&lt;/li&gt;
&lt;li&gt;How do you make this function as performant as possible?&lt;/li&gt;
&lt;/ol&gt;

&lt;h2 id="routing-algorithms"&gt;Routing algorithms&lt;/h2&gt;

&lt;p&gt;Considering Version 3, the first question is: what is the performance of a big &lt;code&gt;case&lt;/code&gt; statement?&lt;/p&gt;

&lt;p&gt;Looks like it runs in O(&lt;i&gt;n&lt;/i&gt;) where &lt;i&gt;n&lt;/i&gt; is the number of possible path handlers. Not the best, especially as &lt;i&gt;n&lt;/i&gt; gets larger. And we really want URL routing performance to be optimal, since this is a component of our web server that gets called on every single request, and it&amp;rsquo;s just overhead, taking time before we can even start generating the right response.&lt;/p&gt;

&lt;p&gt;There are various approaches that people use at this point. Here are two common ones:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A routing search tree&lt;/li&gt;
&lt;li&gt;A routing hash table&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Let&amp;rsquo;s take a look at the search tree approach first.&lt;/p&gt;

&lt;h2 id="dynamic-routing-1-an-nginx-location-tree"&gt;Dynamic routing 1: An NGINX location tree&lt;/h2&gt;

&lt;p&gt;I&amp;rsquo;ve used NGINX as a general purpose webserver for a while. It&amp;rsquo;s highly configurable; the application is organized around the concept of different &lt;a href="http://nginx.org/en/docs/dev/development_guide.html#Modules"&gt;modules&lt;/a&gt; and multiple &lt;a href="http://nginx.org/en/docs/dev/development_guide.html#http_phases"&gt;phases&lt;/a&gt; of request handling.&lt;/p&gt;

&lt;p&gt;Let&amp;rsquo;s take a quick look at the way that static (non-regex) location routing is handled by the NGINX core &lt;code&gt;http&lt;/code&gt; module.&lt;/p&gt;

&lt;p&gt;The route handlers you see in something like NGINX are usually only accessible indirectly, in normal use cases. You don&amp;rsquo;t write a route mapping function like the one we wrote above; its functionality has been abstracted into an operation on a configuration tree. You don&amp;rsquo;t, directly, write a handler function for specific routes either; you invoke an NGINX module that handles some specific path that might, then, pass along a request to your actual application code. All you need to do as an ordinary developer is to generate the needed configuration data, which tells NGINX which modules should handle which paths, and with which configuration options.&lt;/p&gt;

&lt;aside class="blog-aside"&gt;

&lt;p class="aside-title"&gt;Fun Fact&lt;/p&gt;

&lt;p&gt;Technically, it is also possible to write ad hoc handler functions in NGINX configuration, if you really want to do this. You can abuse &lt;code&gt;if&lt;/code&gt; and &lt;code&gt;return&lt;/code&gt; from the &lt;a href="http://nginx.org/en/docs/http/ngx_http_rewrite_module.html#return"&gt;rewrite module&lt;/a&gt; to build arbitrary logic:&lt;/p&gt;

&lt;pre&gt;&lt;code class="nginx"&gt;if ($the_world_is_round) {
  return 200 "Welcome to a heliocentric universe, ${user_name}!";
}
return 500 "Hello ${user_name}, did you still believe in geocentrism?";
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;There are also modules that implement full-fledged scripting languages inside NGINX, such as &lt;a href="http://nginx.org/en/docs/njs/index.html"&gt;njs&lt;/a&gt;. You could develop arbitrarily complex code this way, though it&amp;rsquo;s probably a terrible idea, and I&amp;rsquo;ve never seen that strategy used in practice. We usually do configuration in NGINX, and then put all our business logic in some other framework upstream of it.&lt;/p&gt;

&lt;/aside&gt;

&lt;p&gt;For path routing purposes, each separate configuration node in NGINX is called a &lt;code&gt;Location&lt;/code&gt;. A location can be defined as a prefix string, an exact string match, or a regular expression. The general strategy is to find the best possible prefix/exact match first and then test all the regular expressions. (The regular expressions are all tested in the order they appear in the configuration folder; I don&amp;rsquo;t know how to optimize a search through a list of arbitrary regular expressions.)&lt;/p&gt;

&lt;p&gt;NGINX parses its static (non-regex) locations into a &lt;a href="https://github.com/nginx/nginx/blob/3aef1d6/src/http/ngx_http_core_module.h#L458"&gt;ternary search tree&lt;/a&gt; with this kind of structure:&lt;/p&gt;

&lt;pre&gt;&lt;code class="txt"&gt;         [root element]    &amp;lt;== represented by `/`
                |
                |
               egg
   box -----/   |    \------ house
  /   \         |           /     \
 bat   cat      |        glass    plate
                |
              plant
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;This would be a root with seven locations nested beneath it (&lt;code&gt;/bat&lt;/code&gt;, &lt;code&gt;/box&lt;/code&gt;, &lt;code&gt;/cat&lt;/code&gt;, &lt;code&gt;/egg&lt;/code&gt;, &lt;code&gt;/house&lt;/code&gt;, &lt;code&gt;/glass&lt;/code&gt;, and &lt;code&gt;/plate&lt;/code&gt;), and one child location, &lt;code&gt;/egg/plant&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;It&amp;rsquo;s sorted lexicographically with lower values to the left, higher ones to the right, and child trees (termed &lt;code&gt;inclusive&lt;/code&gt;) in the middle. The parent of each subtree is set to the middle element of the relevant subset of location blocks. It all descends from the root location.&lt;/p&gt;

&lt;p&gt;Looks like the search time here is is O(&lt;i&gt;log n&lt;/i&gt;) for &lt;i&gt;n&lt;/i&gt; sibling location blocks, so we can see why they use this structure. The parent-child relations are just a linear search as far as I know; it will collapse into O(&lt;i&gt;n&lt;/i&gt;) in the pathological case where you had only one gigantic parent-child structure (say a single folder tree 50 levels deep, with no sibling folders).&lt;/p&gt;

&lt;p&gt;Once NGINX finds the correct location block, the location block (and other configuration) will invoke the relevant NGINX modules to build and return a response, handle headers, check authentication, and so on.&lt;/p&gt;

&lt;p&gt;(You can learn a lot more about how this works by compiling NGINX with &lt;code&gt;--with-debug&lt;/code&gt; and setting the &lt;code&gt;debug&lt;/code&gt; log level; it then will report precisely how it searches through the static location tree.)&lt;/p&gt;

&lt;h2 id="dynamic-routing-2-a-drupal-url-alias-table"&gt;Dynamic routing 2: A Drupal URL Alias table&lt;/h2&gt;

&lt;p&gt;Meanwhile, consider an alternative approach that I&amp;rsquo;ve also seen in the wild: the routing &lt;em&gt;table&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;The routing table is most viable &lt;em&gt;if you are using uniform request handling functions with no dynamic path inspection or parent-child directory hierarchy&lt;/em&gt;. Suppose your routing table looks like this:&lt;/p&gt;

&lt;pre&gt;&lt;code class="config"&gt;| path       | handler_function | id |
|------------|------------------|----|
| /about     | page             | 1  |
| /company   | page             | 2  |
| /faq       | page             | 3  |
| /contact   | form             | 1  |
| /buy       | form             | 2  |
| /buy/now   | form             | 3  |
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;You only have two handler functions, &lt;code&gt;page&lt;/code&gt; and &lt;code&gt;form&lt;/code&gt;, so all your router has to do is build this hash table, use &lt;code&gt;path&lt;/code&gt; for the keys, and then call the named type handler function with the given &lt;code&gt;id&lt;/code&gt; parameter. Lookup is theoretically O(&lt;i&gt;1&lt;/i&gt;), it&amp;rsquo;s perfect! You could use such an architecture to map URL paths to files, to database rows, to anything with functions and arguments!&lt;/p&gt;

&lt;p&gt;What&amp;rsquo;s the big use case for a routing table?&lt;/p&gt;

&lt;p&gt;In short: &lt;em&gt;User contributed path structures.&lt;/em&gt; You don&amp;rsquo;t want end users to have to write a Ruby method. You don&amp;rsquo;t want to let them anywhere near an NGINX configuration file. You just want to show them a text field: &amp;ldquo;What path should this page have?&amp;rdquo;&lt;/p&gt;

&lt;p&gt;In that context, a routing table is cheap; it&amp;rsquo;s safe; and it&amp;rsquo;s flexible without needing code or infrastructure changes.&lt;/p&gt;

&lt;p&gt;Drupal 7 does something like this with a &lt;code&gt;url_alias&lt;/code&gt; table (see &lt;a href="https://api.drupal.org/api/drupal/modules%21system%21system.install/function/system_schema/7.x"&gt;system schema&lt;/a&gt;). This is just a relational database table with an alias column and then some other columns telling the system what to route to. Users get to specify the &lt;code&gt;alias&lt;/code&gt; value. If they want it to look like a directory tree, they can just put slashes into the &lt;code&gt;path&lt;/code&gt; value. It&amp;rsquo;s very brittle because end users usually don&amp;rsquo;t do a good job of maintaining a tree-like structure. (Predictably, there is an additional module you can use to auto-populate these values, helpfully named &lt;code&gt;pathauto&lt;/code&gt;.)&lt;/p&gt;

&lt;p&gt;This routing table can&amp;rsquo;t be a hash table in memory, because unlike NGINX, Drupal is a PHP application that doesn&amp;rsquo;t persist configuration data between requests. So it&amp;rsquo;s still a O(&lt;i&gt;1&lt;/i&gt;) indexed database query, but it&amp;rsquo;s a database query per request to figure out the right path. I&amp;rsquo;m pretty sure it&amp;rsquo;s cached, at least, so it is probably quick when the cache is warm.&lt;/p&gt;

&lt;p&gt;The actual schema looks like this:&lt;/p&gt;

&lt;pre&gt;&lt;code class="config"&gt;| pid | alias     | source       | lang |
|-----|-----------|--------------|------|
| 100 | about     | node/1       | en   |
| 101 | home      | node/2       | en   |
| 102 | faq       | node/3       | en   |
| 103 | contact   | form/1       | en   |
| 104 | buy       | form/2       | und  |
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;With indexes:&lt;/p&gt;

&lt;pre&gt;&lt;code class="config"&gt;index alias_language_pid on (&amp;#39;alias&amp;#39;, &amp;#39;language&amp;#39;, &amp;#39;pid&amp;#39;)
index source_language_pid on (&amp;#39;source&amp;#39;, &amp;#39;language&amp;#39;, &amp;#39;pid&amp;#39;)
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Note that it includes a locale parameter &lt;code&gt;lang&lt;/code&gt; (so you can route the same paths differently in different locales). It has bidirectional indexes (you can look up both by &lt;code&gt;alias&lt;/code&gt; and by &lt;code&gt;source&lt;/code&gt;). And the &lt;code&gt;source&lt;/code&gt; field points, curiously, not to a function but to an &lt;em&gt;internal path&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;It&amp;rsquo;s all a bit funky. It turns out that Drupal&amp;rsquo;s url alias system is bolted on top of a whole separate routing system, the strangely termed &amp;ldquo;menu&amp;rdquo; system which handles Drupal&amp;rsquo;s internal paths. So the url alias table just tells Drupal what system path should handle a given request, and then the menu system figures out how to actually call the right handler (&lt;code&gt;node&lt;/code&gt; or &lt;code&gt;form&lt;/code&gt; in this case).&lt;/p&gt;

&lt;p&gt;You have to register internal URL handlers by providing &amp;ldquo;configuration&amp;rdquo; in PHP code like this one for the &lt;a href="https://github.com/drupal/drupal/blob/7.x/modules/node/node.module#L2081"&gt;node module&lt;/a&gt;:&lt;/p&gt;

&lt;pre&gt;&lt;code class="php"&gt;  $items[&amp;#39;node/%node&amp;#39;] = array(
    &amp;#39;title callback&amp;#39; =&amp;gt; &amp;#39;node_page_title&amp;#39;,
    &amp;#39;title arguments&amp;#39; =&amp;gt; array(1),
    &amp;#39;page callback&amp;#39; =&amp;gt; &amp;#39;node_page_view&amp;#39;,
    &amp;#39;page arguments&amp;#39; =&amp;gt; array(1),
    &amp;#39;access callback&amp;#39; =&amp;gt; &amp;#39;node_access&amp;#39;,
    &amp;#39;access arguments&amp;#39; =&amp;gt; array(&amp;#39;view&amp;#39;, 1),
  );
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Believe it or not, this PHP configuration then gets persisted in yet another database table, &lt;code&gt;menu_router&lt;/code&gt;. The &lt;a href="https://api.drupal.org/api/drupal/includes%21menu.inc/function/_menu_router_save/7.x"&gt;full implementation&lt;/a&gt; uses 25 columns of settings and arguments in addition to the path spec itself, and it&amp;rsquo;s not very clearly designed, since &amp;ldquo;menus&amp;rdquo; in Drupal are an awkward mashup of a path routing system with UI navigation menu configuration.&lt;/p&gt;

&lt;p&gt;I&amp;rsquo;m not going to go any farther into the weeds here. Drupal 7 is generally agreed to have really ugly technical architecture. I will say that this whole awkward system actually does work in production, and has powered innumerable large websites. I used to help maintain the website for a billion-dollar-a-year home building company built mainly on Drupal 7. It was a huge mess of spaghetti code too, but that wasn&amp;rsquo;t even the fault of the framework itself. And it lets non-technical users easily generate their own URL path structures. That&amp;rsquo;s arguably a big core feature for a CMS.&lt;/p&gt;

&lt;h2 id="the-daily-wtf-of-routers"&gt;The Daily WTF of routers&lt;/h2&gt;

&lt;p&gt;Here&amp;rsquo;s another fun thing I found at a previous place I worked:&lt;/p&gt;

&lt;p&gt;A homegrown content management system built on Ruby on Rails where URL routing was &lt;em&gt;really, unusably slow&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;It turned out to be implemented with a recursive path lookup function that generated &lt;em&gt;n&lt;/em&gt; x &lt;em&gt;m&lt;/em&gt; database queries to produce the routing table, where &lt;em&gt;n&lt;/em&gt; was the number of pages and &lt;em&gt;m&lt;/em&gt; was the depth of the routing tree.&lt;/p&gt;

&lt;p&gt;To make matters worse, this was for a multisited system that didn&amp;rsquo;t assume that the site root started with &amp;ldquo;/&amp;rdquo;.&lt;/p&gt;

&lt;p&gt;The algorithm was something like this:&lt;/p&gt;

&lt;pre&gt;&lt;code class="ruby"&gt;
class Router
  def route(request_path)
    # Note: This cache worked OK in prod, but was disabled in development:
    routes = Rails.cache.fetch(&amp;quot;page_routes&amp;quot;) do
      Page.all.map {|page| [page.id, page.path] }
    end

    routes.find { |route| request_path == route }
  end
end

class Page &amp;lt; ActiveRecord::Base
  belongs_to :parent, class_name: &amp;quot;Page&amp;quot;, optional: true

  def path(site_id)
    if parent
      # This does a recursive lookup of parent paths,
      # and loading each parent is a separate db call:
      [self.filename, parent.path(site_id)].join(&amp;quot;/&amp;quot;)
    else
      [self.root_path(site_id), self.filename].join(&amp;quot;/&amp;quot;)
    end
  end

  # this is also a database call:
  def self.root_path(site_id)
    self.find_by(site_id: site_id, is_site_root: true)
  end
end
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;In development, the whole routing table was reloaded on every page load. It ended up causing huge delays during development on large sites.&lt;/p&gt;

&lt;p&gt;I improved the performance of this by roughly 50% with some basic improvements, like not looking up the &lt;code&gt;root_path&lt;/code&gt; again for each route. I then suggested shifting the design to use a Drupal-like routing table in the database. However, I believe they may have since abandoned the whole product, which would have rendered any further architectural improvements irrelevant.&lt;/p&gt;

&lt;p&gt;Sometimes the products cease to exist long before they can be improved.&lt;/p&gt;

&lt;p&gt;Well, we all know the problems with premature optimization.&lt;/p&gt;

&lt;h2 id="conclusions"&gt;Conclusions&lt;/h2&gt;

&lt;p&gt;Back in the day, I used to understand web servers as being fundamentally document-based. Static Apache website style. Request a path; it matches a document in a directory tree on a disk; and the document gets sent back to you.&lt;/p&gt;

&lt;p&gt;But you can learn a lot from decoupling your understanding of a web server that speaks HTTP from the concept of a document. In a more abstract way, you can think of an HTTP request as just being a function call with some input parameters. And one of the parameters just happens to be something we call &amp;ldquo;a path.&amp;rdquo; It&amp;rsquo;s just hard to think of it this way when you start out by staring at the configuration layer of something like NGINX. The &amp;ldquo;functional&amp;rdquo; part of it is deeply buried by that point.&lt;/p&gt;

&lt;p&gt;What&amp;rsquo;s more interesting is the huge set of design tradeoffs you can make here. What&amp;rsquo;s the ratio between convention, configuration, and pure flexibility? How much technical expertise will be needed to create a new route? Will your user want to write a request handler from scratch? Will they want to use a DSL to configure routing without needing to execute absolutely arbitrary code? Or will you make routing that&amp;rsquo;s so simple that even a nontechnical user can use it, as in a CMS?&lt;/p&gt;

&lt;p&gt;You can see here how each framework draws the lines differently, and then it&amp;rsquo;s just up to the users to work with the constraints — or struggle against them.&lt;/p&gt;

&lt;h2 id="further-reading"&gt;Further reading&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;NGINX&lt;/em&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://github.com/nginx/nginx/blob/1fecec0cbf1554c0473d5cca0fb55f8dc006e4ba/src/http/ngx_http.c#L799"&gt;NGINX http location config&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.aosabook.org/en/nginx.html"&gt;Another overview of NGINX architecture&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="http://nginx.org/en/docs/dev/development_guide.html#http_request"&gt;NGINX request dev guide&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://nginx.org/en/docs/http/request_processing.html"&gt;NGINX request processing&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://notes.eatonphil.com/learning-a-new-codebase-hacking-nginx.html"&gt;Notes on the NGINX codebase&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;em&gt;Drupal 7&lt;/em&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://github.com/drupal/drupal/blob/7.x/includes/menu.inc"&gt;menu.inc&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/drupal/drupal/blob/7.x/modules/system/system.install#L1738"&gt;url_alias schema&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/drupal/drupal/blob/7.x/modules/system/system.install#L1112"&gt;menu_router&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://api.drupal.org/api/drupal/includes%21path.inc/function/drupal_lookup_path/7.x"&gt;drupal_lookup_path&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/drupal/drupal/blob/7.x/includes/path.inc"&gt;path module&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;em&gt;Ruby on Rails&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The url path router in Ruby on Rails is pretty interesting to read about. It has developed quite a bit over time:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://github.com/rails/rails/tree/main/actionpack/lib/action_dispatch/journey"&gt;action_dispatch/journey&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/rails/rails/blame/2-0-stable/actionpack/lib/action_controller/routing.rb"&gt;git blame action_controller/routing.rb&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://blog1.westagilelabs.com/journey-into-rails-routing-7fbeb8ceb833"&gt;Journey Into Rails Routing&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/stevegraham/rails/pull/1"&gt;A failed proposal to upgrade the Rails router in 2014&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://medium.com/fusion-of-thoughts/revamped-routes-in-rails-3-b6d00654e5b0"&gt;Thoughts on the Rails 3 routing upgrade&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://yehudakatz.com/2009/12/26/the-rails-3-router-rack-it-up/"&gt;Yehuda Katz on the Rails 3 router&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/rails/journey/commit/c0d305f273bf89fdd333d02e69f54aa453428b43"&gt;Journey by tenderlove&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/rails/rails/compare/9aa7c25c2832...5fa65f94b8c1"&gt;Some performance improvements by tenderlove&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/rails/rails/commit/b20c575ac02373723438468932ceddd97056c9ec"&gt;A new routing implementation for Rails 2.0&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/rails/rails/commit/b1999be5a7efd67e2602c37ed898aa8433661863#diff-c7a8cf03eb748d513a4634561e3c843076959aa2fd3f78d8f77b01c710ceca1a"&gt;Rails 1.2 routing commit&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;em&gt;Last tidbit&lt;/em&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A C-based routing library: &lt;a href="https://github.com/c9s/r3"&gt;github.com/c9s/r3&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
</content>
  </entry>
  <entry>
    <title>The hacker spirit</title>
    <link rel="alternate" href="/tech/2022/10/the-hacker-spirit.html"/>
    <id>/tech/2022/10/the-hacker-spirit.html</id>
    <published>2022-10-03T15:47:00+00:00</published>
    <updated>2022-10-03T15:47:00+00:00</updated>
    <author>
      <name>Eli Thorkelson</name>
    </author>
    <summary type="html">&lt;p&gt;I have a PhD in cultural anthropology and I’m a software engineer. How did
that happen?&lt;/p&gt;

&lt;p&gt;I guess I’ve had a bit of the hacker spirit for a long time. It was part of
the culture I was raised with. Building things. Stumbling into new places.
Asking questions...&lt;/p&gt;</summary>
    <content type="html">&lt;p&gt;I have a PhD in cultural anthropology and I&amp;rsquo;m a software engineer. How did
that happen?&lt;/p&gt;

&lt;p&gt;I guess I&amp;rsquo;ve had a bit of the hacker spirit for a long time. It was part of
the culture I was raised with. Building things. Stumbling into new places.
Asking questions. Being skeptical, but not just in a negative way: skepticism
can be a very hopeful gesture.&lt;/p&gt;

&lt;p&gt;Exploring. That&amp;rsquo;s what it&amp;rsquo;s about for me: exploring.&lt;/p&gt;

&lt;p&gt;It seems like I have had an unusual path through tech, compared to everyone
who was a CS major in college and worked in tech their whole career. Here are
some notes on how that happened.&lt;/p&gt;

&lt;hr&gt;

&lt;p&gt;I was a kid in the late 20th century. When I was little we had a Macintosh
SE (when my dad did graphic design) and an old PC with DOS. I wrote some
little BASIC programs and (believe it or not) some &lt;a href="https://en.wikipedia.org/wiki/HyperCard"&gt;HyperCard stacks&lt;/a&gt;.
I think the first machine I owned was a Macintosh Performa. I spent a while
trying to learn native GUI programming at that point, using a thick reference
manual for Apple interface building. I think the only thing I ever finished
was a screen saver demo animation.&lt;/p&gt;

&lt;p&gt;Computers were only one of the technical systems I used to like. For a few
years, I was in love with video production. At that point the professional
gear was still largely analog. I spent some time in TV studios — they were
little, but run by professionals. I crewed some educational broadcasts that
went out on satellite; I was an intern at a local cable company for a year; I
went on some multicamera shoots, with a van on location. I never tried to do
it for a living, not even close. I just loved being around the technology,
the visual design part too, framing shots just the right way, cutting clips
at just the right moment. I played with editing gear and I made a trippy
video of my own about the alienating landscape of my high school.&lt;/p&gt;

&lt;p&gt;Then I got deep into lighting for theatres. I worked in summer theatre as a
stage electrician; I ran a followspot one season; I climbed a lot of ladders
and catwalks, lugged around a lot of gear, and worked late nights for free or
really bad pay. I had a lighting design teacher from the local university,
which had a MFA drama school. He brought me as an assistant to one of his
professional gigs, doing lighting design for an opera. I learned how to
design lights for a show, how to run the lighting console, how to plan the
logistics.&lt;/p&gt;

&lt;p&gt;There&amp;rsquo;s a lot of hacker spirit in theatres. You&amp;rsquo;re building things that you
just dreamed up. You&amp;rsquo;re running at the very limits of your capacities. It&amp;rsquo;s
a wild place.&lt;/p&gt;

&lt;p&gt;Meanwhile, I was taking some computer science classes — mainly C style
languages with object oriented features thrown in, as used to be the rage in
the 1990s. I must have done a year of C, a semester of C++, and a semester of
Java. We did quicksort and I wrote a really basic web crawler
implementation. Meh.&lt;/p&gt;

&lt;p&gt;I just didn&amp;rsquo;t love CS classes. Whatever the hacker spirit is, they didn&amp;rsquo;t have
enough of it. They were taught in a pretty rote, &amp;ldquo;memorize this&amp;rdquo; way. They
were dull. The intro ones were all a little too easy. And they didn&amp;rsquo;t help
me (when I was 18 or 19) figure out the answers to the big existential
questions that I desperately wanted to figure out.&lt;/p&gt;

&lt;p&gt;So I ended up studying a humanities field, cultural anthropology, that was a
lot better at big philosophical questions than anything I found in STEM.&lt;/p&gt;

&lt;p&gt;Around the same time, I found out I could get paid to build software without
finishing a CS degree.&lt;/p&gt;

&lt;hr&gt;

&lt;p&gt;That was the beginning of a long, meandering period where I learned tech on
the job.&lt;/p&gt;

&lt;p&gt;I started to work for a language laboratory in college, Cornell&amp;rsquo;s Language
Resource Center. I learned some Python, which powered our then-fancy Zope
platform. I wrote online quiz software for language learners (we weren&amp;rsquo;t
using commercial learning management systems in those days). Before long
before we needed non-document-based data storage, so I set up a MySQL
instance, made it talk to Python, and learned something about normalized
database schemas.&lt;/p&gt;

&lt;p&gt;Then I went straight to grad school in cultural anthropology and didn&amp;rsquo;t write
much code for a few years.&lt;/p&gt;

&lt;p&gt;After I finished the field research part of grad school, I got back into web
programming at the University of Chicago, in the IT group for the humanities
graduate school. At first I worked on their public-facing websites
(mainly Drupal); I remember building a custom event planning module for a big
annual event. Next door, there was a web applications programmer who was
building internal administrative software in Ruby on Rails, which sounded
more exciting. He gave me a crash course in Ruby, in the MVC pattern, and in
test-driven development. Soon he left for a startup, and I got hired into his
position.&lt;/p&gt;

&lt;p&gt;I found myself going every day to the office and sitting at a big software
development setup with a bunch of monitors.&lt;/p&gt;

&lt;hr&gt;

&lt;p&gt;What I loved about my first full-time software development job was that I had
so much to explore.&lt;/p&gt;

&lt;p&gt;The culture of technology was wild around then (~2012). In academia, things
move really slowly, but in tech, there was constant flux, shifting trends,
unstable new projects. The &amp;ldquo;new Javascript framework every month&amp;rdquo; thing was
setting in taking hold. A lot of history was happening, somehow.&lt;/p&gt;

&lt;p&gt;My boss told me once that I had a particular skill: it wasn&amp;rsquo;t just that I
could write code, it was also that I could take a preliminary set of
requirements — usually not very clear ones — and then build working systems
from that starting place, pretty much all by myself, without needing
micromanaging. I have to say, I enjoyed the autonomy I had.&lt;/p&gt;

&lt;p&gt;I was building administrative applications in Ruby on Rails and Javascript. So
we had clients, but they were only internal clients. I got a salary and we
were free from commercial pressures.&lt;/p&gt;

&lt;p&gt;I built our testing infrastructure up from almost nothing. I built a realtime
dashboard app to monitor activity on our products. I got a lot of practice
triaging production exceptions (which ones are urgent? which ones can wait a
little bit? do we need extra logging or debugging?). I worked on performance,
on authentication systems, on database design. I built lots of things.&lt;/p&gt;

&lt;p&gt;As long as there were tests and there was a good project plan, I was trusted
to be an autonomous professional and to do solid work.&lt;/p&gt;

&lt;p&gt;I didn&amp;rsquo;t realize until later that in bigger commercial environments, you
usually don&amp;rsquo;t run your own projects as a software engineer. But in that case,
I was the only web application developer in our group, so I spent quite a bit
of time talking with our clients, mainly admin staff who needed software to
simplify their work. I knew almost every one of our users by name. They could
email me and I would help them if they needed support.&lt;/p&gt;

&lt;p&gt;That human connection was possible because we were writing software for only a
few hundred users at most.&lt;/p&gt;

&lt;p&gt;These days, I have to say I miss that sense of personal connection with the
users.&lt;/p&gt;

&lt;hr&gt;

&lt;p&gt;Looking back, it seems so improbable that I could go from playing with
HyperCard in the 1990s to being a professional software developer a few
decades later. I suppose life is full of those surprises.&lt;/p&gt;

&lt;p&gt;I just still look out for the moments of joy and exploration in what I do.&lt;/p&gt;

&lt;p&gt;And I automate the boring stuff.&lt;/p&gt;
</content>
  </entry>
  <entry>
    <title>How this site is built</title>
    <link rel="alternate" href="/tech/2022/10/how-this-site-is-built.html"/>
    <id>/tech/2022/10/how-this-site-is-built.html</id>
    <published>2022-10-02T20:33:00+00:00</published>
    <updated>2022-10-02T20:33:00+00:00</updated>
    <author>
      <name>Eli Thorkelson</name>
    </author>
    <summary type="html">&lt;p&gt;&lt;em&gt;Update: For a more recent account of the site setup, see &lt;a href="/tech/2023/05/downsizing-servers.html"&gt;How to downsize a tiny web server and the services on it&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;OK, here we are, on my website. How is it generated? How is it hosted?&lt;/p&gt;

&lt;p&gt;First, let’s talk about the context. The technical constraints...&lt;/p&gt;</summary>
    <content type="html">&lt;p&gt;&lt;em&gt;Update: For a more recent account of the site setup, see &lt;a href="/tech/2023/05/downsizing-servers.html"&gt;How to downsize a tiny web server and the services on it&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;OK, here we are, on my website. How is it generated? How is it hosted?&lt;/p&gt;

&lt;p&gt;First, let&amp;rsquo;s talk about the context. The technical constraints always come from the context.&lt;/p&gt;

&lt;p&gt;This is a very low traffic site, with only static content. It contains a basic website, plus some downloadable PDFs of things I wrote.&lt;/p&gt;

&lt;p&gt;That&amp;rsquo;s already a very different problem from the things I work on at work!&lt;/p&gt;

&lt;h2 id="priorities"&gt;Priorities&lt;/h2&gt;

&lt;p&gt;First of all, since this is my project, I get to choose the priorities.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;I care a lot about the content. &lt;em&gt;There isn&amp;rsquo;t a lot of new content here, but what&amp;rsquo;s here should be solid.&lt;/em&gt;&lt;/li&gt;
&lt;li&gt;I care about site performance. &lt;em&gt;Fortunately, static HTML on modern hosting is already fast, so I don&amp;rsquo;t have to do much to improve it.&lt;/em&gt;&lt;/li&gt;
&lt;li&gt;I care about continuity. &lt;em&gt;If I publish a piece of writing here, I want the link to keep working indefinitely.&lt;/em&gt;&lt;/li&gt;
&lt;li&gt;I care about basic security. &lt;em&gt;Note 1: Don&amp;rsquo;t run web applications unless you can commit to security patches&amp;hellip; Note 2: I do have a valid TLS cert now. Though I certainly didn&amp;rsquo;t have a TLS cert in the early days - nobody did, not for a site like this. I remember when I first wanted one, it was still a huge pain to get a certificate — you had to buy it through some awful DNS registrar package deal, and manually download new cert files every time it expired. Now I just use letsencrypt.&lt;/em&gt;&lt;/li&gt;
&lt;li&gt;I care about minimizing useless maintenance overhead. I care about avoiding unnecessary runtime dependencies (which can in turn create security issues).&lt;/li&gt;
&lt;li&gt;I use this site to stay in touch with basic Linux administration and old school web technology. I like tinkering with my own HTML. I like poking around at nginx configuration. &lt;em&gt;I like remembering how much you can get done with something super, super basic. No containerization here, not so far.&lt;/em&gt;&lt;/li&gt;
&lt;li&gt;I care about minimizing hosting costs (up to a point).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In sum, I&amp;rsquo;m here to learn a few new things, play with servers a little, and keep a stable web presence.&lt;/p&gt;

&lt;h2 id="non-priorities"&gt;Non-priorities&lt;/h2&gt;

&lt;p&gt;There are also some things I &lt;em&gt;don&amp;rsquo;t&lt;/em&gt; really care about.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;I don&amp;rsquo;t really care about analytics or server monitoring (as long as the server is not on fire).&lt;/li&gt;
&lt;li&gt;I don&amp;rsquo;t care about tracking inbound links.&lt;/li&gt;
&lt;li&gt;I don&amp;rsquo;t care about supporting discussion or dialogue on the site itself. &lt;em&gt;I love talking to people, I just don&amp;rsquo;t have to run a forum right here.&lt;/em&gt;&lt;/li&gt;
&lt;li&gt;I&amp;rsquo;m not optimizing SEO. At all. &lt;em&gt;But my name is already unusual so that&amp;rsquo;s doing some SEO all by itself.&lt;/em&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id="hosting"&gt;Hosting&lt;/h2&gt;

&lt;p&gt;I&amp;rsquo;ve had a lot of web hosting arrangements over the years. I think the history was something like this:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;1999: My very first, very silly website was static HTML hosted by my ISP. They had a unix system you could log into, back when small-town local ISPs were more common, the kind that could serve a website for every user at &lt;code&gt;server.com/~user&lt;/code&gt;. I think you had to upload the files with FTP. It wasn&amp;rsquo;t encrypted, but using a dialup connection to their systems, maybe it wasn&amp;rsquo;t all that insecure?&lt;/li&gt;
&lt;li&gt;2000-2002: Every dorm room at my college had wired Ethernet, a public IP address, and a stable hostname. So I self-hosted my website from an ancient version of OSX (which had a built-in web server back in the day).&lt;/li&gt;
&lt;li&gt;2003-4: I had a personal website in some sandbox folder of a shared campus web server.&lt;/li&gt;
&lt;li&gt;2005–6: I was starting grad school and didn&amp;rsquo;t have a web presence for a year or two.&lt;/li&gt;
&lt;li&gt;2007–2014: I registered this website and set it up on &amp;ldquo;shared hosting.&amp;rdquo; It was cheap but irritating; I disliked cPanel.&lt;/li&gt;
&lt;li&gt;2014–present: Switched over to hosting on a cheap virtualized linux box. Originally it ran Apache, later switched to nginx. It&amp;rsquo;s boring. It works great. CPU load is usually near 0%.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id="site-generation"&gt;Site generation&lt;/h2&gt;

&lt;p&gt;Historically, this site has &lt;em&gt;always&lt;/em&gt; been basically static HTML, with hand-rolled CSS. I used to write a new stylesheet every so often, just because I could.&lt;/p&gt;

&lt;p&gt;I&amp;rsquo;ve always supplemented the static files with some extra programmatic tools, when I needed them. For example:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;In 2002, the site was static, but it hosted some images on a Python-based web server, which provided analytics.&lt;/li&gt;
&lt;li&gt;From 2007 to 2022, I had a WordPress blog hosted in a subdirectory of the main site. I liked the (old-school) WordPress post editor, and I liked blog comments back in the days when people actually used to use them.&lt;/li&gt;
&lt;li&gt;In about 2013, I wrote some dynamic code to programmatically display my progress in writing my dissertation. It used data from a git repository history and from Asana (a task tracking application) to display the progress. The data on the server had to be updated periodically, with a script I invoked manually - I never needed to automate it.&lt;/li&gt;
&lt;li&gt;In 2016 or so, I wrote a Ruby script to generate the navigation menu for the static files.&lt;/li&gt;
&lt;li&gt;In 2022, I ported the static site over to &lt;a href="https://middlemanapp.com/"&gt;Middleman&lt;/a&gt; to make it easier to maintain. I also like Markdown.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Right now, a static site generator is the sweet spot for me between &amp;ldquo;100% hand edited HTML files in a directory&amp;rdquo; and &amp;ldquo;100% dynamically generated content.&amp;rdquo;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;I do wish I had a lightweight solution for contact forms. I used to use PHP for that once in a while, but only because I used to need it for WordPress. Now… 🤷‍♀️.&lt;/em&gt;&lt;/p&gt;

&lt;h2 id="conclusions"&gt;Conclusions&lt;/h2&gt;

&lt;p&gt;This site looks pretty basic, but it actually takes a lot of work over the years to keep it going. The requirements of the web are always changing. I don&amp;rsquo;t want it to look &lt;em&gt;too&lt;/em&gt; dated. I want it to work on mobile. I want it to keep running for decades at a time.&lt;/p&gt;

&lt;p&gt;Minimalism is not actually all that cheap, when you think about it.&lt;/p&gt;
</content>
  </entry>
</feed>
