<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:itunes="http://www.itunes.com/dtds/podcast-1.0.dtd" xmlns:googleplay="http://www.google.com/schemas/play-podcasts/1.0"><channel><title><![CDATA[From Scratch]]></title><description><![CDATA[I build software from scratch and share my learnings! 

I especially enjoy topics around compilers, distributed systems, GPU programming, and Linux.]]></description><link>https://michalpitr.substack.com</link><image><url>https://substackcdn.com/image/fetch/$s_!-HWp!,w_256,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbe5ab36b-c782-420e-ac35-c7599a1f77ad_976x976.png</url><title>From Scratch</title><link>https://michalpitr.substack.com</link></image><generator>Substack</generator><lastBuildDate>Sun, 24 May 2026 17:06:19 GMT</lastBuildDate><atom:link href="https://michalpitr.substack.com/feed" rel="self" type="application/rss+xml"/><copyright><![CDATA[Michal Pitr]]></copyright><language><![CDATA[en]]></language><webMaster><![CDATA[michalpitr@substack.com]]></webMaster><itunes:owner><itunes:email><![CDATA[michalpitr@substack.com]]></itunes:email><itunes:name><![CDATA[Michal Pitr]]></itunes:name></itunes:owner><itunes:author><![CDATA[Michal Pitr]]></itunes:author><googleplay:owner><![CDATA[michalpitr@substack.com]]></googleplay:owner><googleplay:email><![CDATA[michalpitr@substack.com]]></googleplay:email><googleplay:author><![CDATA[Michal Pitr]]></googleplay:author><itunes:block><![CDATA[Yes]]></itunes:block><item><title><![CDATA[The Missing 2x: Why Int8 is Twice as Fast as BF16]]></title><description><![CDATA[From bf16 to int8 subword parallelism]]></description><link>https://michalpitr.substack.com/p/subword-parallelism</link><guid isPermaLink="false">https://michalpitr.substack.com/p/subword-parallelism</guid><dc:creator><![CDATA[Michal Pitr]]></dc:creator><pubDate>Sat, 23 May 2026 00:17:52 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/9e247d6b-d10d-4bcf-8207-8ed1f2dd4001_3840x2160.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Have you ever looked at an ML accelerator spec sheet and noticed that peak OPs/s at int8 are <em>exactly </em>2x higher than at bf16? But why? Is it just the type size difference? If it were just about storage, wouldn&#8217;t it be more useful to support fp8 instead?</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;markdown&quot;,&quot;nodeId&quot;:&quot;078a88ae-1d4f-45f4-9e60-2ad79d39b15b&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-markdown">| Accelerator | Peak bf16 (TFLOPs/s) | Peak int8 (TOPs/s) |
|-------------|----------------------|--------------------|
| TPU v6e     |                  918 |               1836 |
| TPU v5e     |                  197 |                393 |
| B200.       |                 5000 |              10000 |</code></pre></div><p>And why do older chips, like TPU v4, offer identical throughput at int8 and bf16?</p><p>By the end of this article, we&#8217;ll cover:</p><ol><li><p>What systolic arrays are</p></li><li><p>How the bf16 format works</p></li><li><p>How a bf16 processing element is implemented</p></li><li><p>How to extend it to support int8</p></li><li><p>How to achieve 2x throughput</p></li></ol><p></p><blockquote><p>This article reuses animations from my <a href="https://youtu.be/fNLuM9uu4kY?si=VEyPtkB60QWgGhU4">YouTube video</a> covering the same topic. The article goes into much more depth, but do let me know what you think about the video format! It&#8217;s something I want to play around with more in the future.</p></blockquote><p></p><h1>Systolic Arrays</h1><p>It would be an understatement to say that matrix multiplication is hard to optimize on traditional processors. I took a shot at <a href="https://open.substack.com/pub/michalpitr/p/optimizing-matrix-multiplication">CPU matmul optimizations in a previous article</a> and it&#8217;s a good chunk of work to get even within a stone&#8217;s throw of reference implementations.</p><p>But fret not, throwing hardware at the problem is always an option and matmul is no exception. Systolic arrays are a particular type of circuit designed to achieve theoretically optimal arithmetic intensity (~do maximum amount of work per memory access) [1].<br><br>Take a look at the animation below to see a 4x4 weight-stationary systolic array in action. TPUs can have multiple systolic arrays per chip, each up to 256x256 in size.</p><div class="native-video-embed" data-component-name="VideoPlaceholder" data-attrs="{&quot;mediaUploadId&quot;:&quot;2fa13c0e-863a-4e8a-815e-98d15fb53a49&quot;,&quot;duration&quot;:null}"></div><p>Weight-stationary refers to how weights are loaded into the systolic array. Unlike the streamed activations, weights are loaded via double buffering. This significantly reduces the amount of data needing to be transferred between PEs every cycle.</p><h1>bfloat16 format</h1><p>Each cell in the systolic array, a so-called processing element, does one simple operation: a <strong>multiply-accumulate</strong>&#8212;the innermost body of the matrix multiplication code below:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;c&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-c">for (int i = 0; i &lt; N; ++i) {
  for (int j = 0; j &lt; M; ++j) {
    for (int k = 0; j &lt; K; ++k) {
      c[i][j] = a[i][k] * b[k][j] + c[i][j];
    }
  }
}</code></pre></div><p>Below is an illustrated execution of the processing element.</p><div class="native-video-embed" data-component-name="VideoPlaceholder" data-attrs="{&quot;mediaUploadId&quot;:&quot;0e8745b3-75dd-4162-8c69-95783a3b3952&quot;,&quot;duration&quot;:null}"></div><p></p><h2>bfloat16 format</h2><p>To bridge the software-hardware gap, we need to understand how floats, particularly bf16 in our case, are encoded in bit form. I&#8217;ll assume you are familiar with how unsigned integers are encoded <em>(if not, just nod along and trust the math)</em>.<br><br>Figure 1 illustrates the bf16 format. It has 3 parts: a sign bit, 8 exponent bits, and 7 mantissa bits. Let&#8217;s tackle these 1 by 1.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!PyLv!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff80d327a-5f7d-4e3a-a918-5153543e4b06_1921x548.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!PyLv!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff80d327a-5f7d-4e3a-a918-5153543e4b06_1921x548.png 424w, https://substackcdn.com/image/fetch/$s_!PyLv!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff80d327a-5f7d-4e3a-a918-5153543e4b06_1921x548.png 848w, https://substackcdn.com/image/fetch/$s_!PyLv!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff80d327a-5f7d-4e3a-a918-5153543e4b06_1921x548.png 1272w, https://substackcdn.com/image/fetch/$s_!PyLv!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff80d327a-5f7d-4e3a-a918-5153543e4b06_1921x548.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!PyLv!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff80d327a-5f7d-4e3a-a918-5153543e4b06_1921x548.png" width="728" height="207.5" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f80d327a-5f7d-4e3a-a918-5153543e4b06_1921x548.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:415,&quot;width&quot;:1456,&quot;resizeWidth&quot;:728,&quot;bytes&quot;:103652,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://michalpitr.substack.com/i/193965704?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff80d327a-5f7d-4e3a-a918-5153543e4b06_1921x548.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!PyLv!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff80d327a-5f7d-4e3a-a918-5153543e4b06_1921x548.png 424w, https://substackcdn.com/image/fetch/$s_!PyLv!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff80d327a-5f7d-4e3a-a918-5153543e4b06_1921x548.png 848w, https://substackcdn.com/image/fetch/$s_!PyLv!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff80d327a-5f7d-4e3a-a918-5153543e4b06_1921x548.png 1272w, https://substackcdn.com/image/fetch/$s_!PyLv!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff80d327a-5f7d-4e3a-a918-5153543e4b06_1921x548.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">Figure 1: BF16 format</figcaption></figure></div><h2>Sign</h2><p>The sign bit is simple: 0 depicts a positive number, 1 a negative number.</p><h2>Exponent</h2><p>In the formula above in figure 1, notice how the exponent is interpreted as an unsigned integer from which a bias is subtracted and the result is used to exponentiate the number 2. While somewhat arbitrary at first glance, let&#8217;s work through this.</p><p>The range of possible exponents is between -126 to +127, using up 254 of the 256 total possible values. The remaining values are used for special states like NaN and INF. We&#8217;ll ignore the existence of these states (<em>a luxury hardware engineers don't have</em>), but it&#8217;s worth highlighting that they exist and the hardware must correctly handle them.</p><p>It&#8217;s also worth highlighting that single precision floats (fp32) use exactly the same exponent encoding and number of bits. </p><h2>Mantissa</h2><p>The mantissa represents a fractional number in the interval [1, 2). This is achieved by having 7 variable bits and an implicit 8th most-significant bit always set to 1. If we interpret this as an integer, it gives us a range from 128 to 255. To get to the fractional number, we can divide by 128.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!0CdH!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2cbeec0a-fed0-444b-a062-7258abda202c_1600x597.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!0CdH!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2cbeec0a-fed0-444b-a062-7258abda202c_1600x597.png 424w, https://substackcdn.com/image/fetch/$s_!0CdH!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2cbeec0a-fed0-444b-a062-7258abda202c_1600x597.png 848w, https://substackcdn.com/image/fetch/$s_!0CdH!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2cbeec0a-fed0-444b-a062-7258abda202c_1600x597.png 1272w, https://substackcdn.com/image/fetch/$s_!0CdH!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2cbeec0a-fed0-444b-a062-7258abda202c_1600x597.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!0CdH!,w_2400,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2cbeec0a-fed0-444b-a062-7258abda202c_1600x597.png" width="722" height="269.2623626373626" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/2cbeec0a-fed0-444b-a062-7258abda202c_1600x597.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:false,&quot;imageSize&quot;:&quot;large&quot;,&quot;height&quot;:543,&quot;width&quot;:1456,&quot;resizeWidth&quot;:722,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:&quot;center&quot;,&quot;offset&quot;:false}" class="sizing-large" alt="" srcset="https://substackcdn.com/image/fetch/$s_!0CdH!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2cbeec0a-fed0-444b-a062-7258abda202c_1600x597.png 424w, https://substackcdn.com/image/fetch/$s_!0CdH!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2cbeec0a-fed0-444b-a062-7258abda202c_1600x597.png 848w, https://substackcdn.com/image/fetch/$s_!0CdH!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2cbeec0a-fed0-444b-a062-7258abda202c_1600x597.png 1272w, https://substackcdn.com/image/fetch/$s_!0CdH!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2cbeec0a-fed0-444b-a062-7258abda202c_1600x597.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Figure 2: Mantissa format</figcaption></figure></div><blockquote><p><strong>The Mantissa Invariant:</strong> Because of the implicit leading 1, the mantissa <em>always</em> represents a value in the interval <strong>[1, 2)</strong>. Keep this rule in mind&#8212;it is going to cause us a minor headache once we tackle multiplication.</p></blockquote><p></p><h2>Encoding float into bf16 format</h2><p>Since the above feels a bit reversed - it tells one how to read a bf16 number not how to encode one - take a look at the animation below showing how to marshall a float into the binary format.</p><div class="native-video-embed" data-component-name="VideoPlaceholder" data-attrs="{&quot;mediaUploadId&quot;:&quot;373fb9bd-e272-4422-9b91-54857882c1cd&quot;,&quot;duration&quot;:null}"></div><p></p><h2>bfloat16 multiplication</h2><p>Now that we understand how bf16 is represented, let&#8217;s see how to multiply two bf16 numbers. Figure 3 shows something interesting: algebraically we can determine the sign, exponent, and mantissa independently of each other. Let&#8217;s see how to implement this in hardware.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!YaUg!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc8eae86-cd45-4884-b419-3e9ea294a0c7_618x95.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!YaUg!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc8eae86-cd45-4884-b419-3e9ea294a0c7_618x95.png 424w, https://substackcdn.com/image/fetch/$s_!YaUg!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc8eae86-cd45-4884-b419-3e9ea294a0c7_618x95.png 848w, https://substackcdn.com/image/fetch/$s_!YaUg!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc8eae86-cd45-4884-b419-3e9ea294a0c7_618x95.png 1272w, https://substackcdn.com/image/fetch/$s_!YaUg!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc8eae86-cd45-4884-b419-3e9ea294a0c7_618x95.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!YaUg!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc8eae86-cd45-4884-b419-3e9ea294a0c7_618x95.png" width="484" height="74.40129449838187" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/dc8eae86-cd45-4884-b419-3e9ea294a0c7_618x95.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:95,&quot;width&quot;:618,&quot;resizeWidth&quot;:484,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!YaUg!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc8eae86-cd45-4884-b419-3e9ea294a0c7_618x95.png 424w, https://substackcdn.com/image/fetch/$s_!YaUg!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc8eae86-cd45-4884-b419-3e9ea294a0c7_618x95.png 848w, https://substackcdn.com/image/fetch/$s_!YaUg!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc8eae86-cd45-4884-b419-3e9ea294a0c7_618x95.png 1272w, https://substackcdn.com/image/fetch/$s_!YaUg!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc8eae86-cd45-4884-b419-3e9ea294a0c7_618x95.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">Figure 3:  Associativity rules</figcaption></figure></div><h3>Sign</h3><p>The resulting sign can be determined with an XOR.</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;markdown&quot;,&quot;nodeId&quot;:&quot;b4019c3e-6425-4479-82ab-72d9eac00a95&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-markdown">| A | B | XOR |
|---|---|-----|
| 0 | 0 |   0 |
| 0 | 1 |   1 |
| 1 | 0 |   1 |
| 1 | 1 |   0 |</code></pre></div><p></p><h3>Exponent</h3><p>Like we saw in Figure 3, multiplying two numbers of the form 2^x and 2^y is the same as adding their exponents 2^{x+y}. But remember, the 8-bit values stored in the hardware (let's call them <code>E_A</code> and <code>E_B</code>) are <em><strong>biased</strong></em>. The <em>actual</em> mathematical values we are multiplying are:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;2^{E_A - 127} \\times 2^{E_B - 127} = 2^{(E_A - 127) + (E_B - 127)} = 2^{E_A + E_B - 254}&quot;,&quot;id&quot;:&quot;WRAVYCVWRY&quot;}" data-component-name="LatexBlockToDOM"></div><p>We need to store this new product back into the bf16 format. This means the hardware must compute a new stored exponent, <code>E_result</code>, which will <em>also</em> have a single bias of 127 implicitly subtracted from it during decoding:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;E_{result} - 127 = E_A + E_B - 254&quot;,&quot;id&quot;:&quot;EXEQDHVWCF&quot;}" data-component-name="LatexBlockToDOM"></div><p>If we solve for the value the hardware actually needs to store (<code>E_result</code>), we get:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;E_{result} = E_A + E_B - 127&quot;,&quot;id&quot;:&quot;MXKTMQYEQG&quot;}" data-component-name="LatexBlockToDOM"></div><p>This is great news! To compute the new exponent, the circuit just naively adds the two 8-bit unsigned integers together and subtracts 127. This can be implemented extremely efficiently using standard integer adders.</p><h3>Mantissa</h3><p>Earlier, we saw that the mantissa is of the form <code>1.M</code>, where <code>M</code> is 7 bits. Logically, this is equivalent to an 8-bit integer implicitly divided by 128.</p><p>We also established the <strong>mantissa invariant</strong>: the value must always stay in the interval <strong>[1, 2)</strong>.</p><p>If we multiply two mantissas together, we&#8217;ll get a 16-bit product that can fall anywhere in the range <strong>[1, 4)</strong>. That breaks the invariant! Let&#8217;s see this in action using good ol&#8217; long multiplication.</p><p>  </p><div class="native-video-embed" data-component-name="VideoPlaceholder" data-attrs="{&quot;mediaUploadId&quot;:&quot;cf687e9d-58f0-490c-a392-79b71fea713e&quot;,&quot;duration&quot;:null}"></div><p>If the product happens to be 2 or greater, it breaks our invariant! To re-establish it, the hardware has to check the result. If the product overflowed into the <strong>[2, 4)</strong> range, the circuit divides the mantissa by 2 (a simple bitwise right-shift) and compensates by adding 1 to the exponent. A proper implementation would have to check if this +1 causes an exponent overflow, <em>but we will happily sweep that under the rug to protect our sanity.</em></p><p>After normalization, the next thing we need to discuss is the number of bits. Since we&#8217;ve established the invariant, we have 15 bits and the MSB is guaranteed to be 1, so we can treat it as the leading implicit 1 bit. That leaves us with 14 variable bits which, <em>if I count right,</em> is 7 more than we fit into a bf16 mantissa. This leaves us with 2 options:</p><ol><li><p>Round down to bf16</p></li><li><p>Stay in higher precision</p></li></ol><p>It might be tempting to go with choice one, but let&#8217;s think back - these multiplications happen in huge systolic arrays. Rounding errors can quickly accumulate and become catastrophic at that scale. </p><blockquote><p>If particularly curious, try to write a script to simulate the rounding losses. Does it match your intuition?</p></blockquote><p>Instead, it&#8217;s common to keep the bf16-bf16 product in fp32. That provides enough breathing room and as seen in figure 3, the only difference between bf16 and fp32 is the number of mantissa bits, making promotion trivial.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!xjkN!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8df2958b-763f-4398-9326-3c656aa287d4_3122x736.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!xjkN!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8df2958b-763f-4398-9326-3c656aa287d4_3122x736.png 424w, https://substackcdn.com/image/fetch/$s_!xjkN!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8df2958b-763f-4398-9326-3c656aa287d4_3122x736.png 848w, https://substackcdn.com/image/fetch/$s_!xjkN!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8df2958b-763f-4398-9326-3c656aa287d4_3122x736.png 1272w, https://substackcdn.com/image/fetch/$s_!xjkN!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8df2958b-763f-4398-9326-3c656aa287d4_3122x736.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!xjkN!,w_2400,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8df2958b-763f-4398-9326-3c656aa287d4_3122x736.png" width="858" height="202.125" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/8df2958b-763f-4398-9326-3c656aa287d4_3122x736.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:false,&quot;imageSize&quot;:&quot;large&quot;,&quot;height&quot;:343,&quot;width&quot;:1456,&quot;resizeWidth&quot;:858,&quot;bytes&quot;:246340,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://michalpitr.substack.com/i/193965704?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8df2958b-763f-4398-9326-3c656aa287d4_3122x736.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:&quot;center&quot;,&quot;offset&quot;:false}" class="sizing-large" alt="" srcset="https://substackcdn.com/image/fetch/$s_!xjkN!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8df2958b-763f-4398-9326-3c656aa287d4_3122x736.png 424w, https://substackcdn.com/image/fetch/$s_!xjkN!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8df2958b-763f-4398-9326-3c656aa287d4_3122x736.png 848w, https://substackcdn.com/image/fetch/$s_!xjkN!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8df2958b-763f-4398-9326-3c656aa287d4_3122x736.png 1272w, https://substackcdn.com/image/fetch/$s_!xjkN!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8df2958b-763f-4398-9326-3c656aa287d4_3122x736.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">Figure 3: bf16 vs fp32</figcaption></figure></div><p>Let&#8217;s see bf16 multiplication end-to-end in the circuit form.</p><div class="native-video-embed" data-component-name="VideoPlaceholder" data-attrs="{&quot;mediaUploadId&quot;:&quot;8ba4b076-0851-4410-b4f7-92efc4b495ce&quot;,&quot;duration&quot;:null}"></div><h2>Addition</h2><p>So we&#8217;ve multiplied two bf16 numbers and produced fp32 product. Now we need to add it to the fp32 partial result coming from the PE above.</p><p>We don't need to dive too deep into addition, but the key takeaway is that float addition requires a lot of circuitry to align decimal places before the actual math can commence. Watch the video below and notice how the circuit might have to shift numbers twice: once during initial alignment and then later to fix mantissa invariant if leading 1s get canceled out.</p><div class="native-video-embed" data-component-name="VideoPlaceholder" data-attrs="{&quot;mediaUploadId&quot;:&quot;a1b345c6-6b68-480f-93d6-0191945be150&quot;,&quot;duration&quot;:null}"></div><p>These shifters are expensive! In my local implementations synthesized with yosys, the 8x8 multiplier from earlier (with support for signed and unsigned integers) costs ~680 NAND2 gate equivalents, whereas the two shifters cost combined 760 NAND2 gates.</p><p>This is a notable difference from fp32 PEs and a big part of the reason that bf16 is preferred. In fp32 the cost is dominated by the multiplier unit since its footprint scales quadratically with the number of mantissa bits. Addition, on the other hand, is dominated by the alignment shifters [2].</p><h1>Supporting int8</h1><p>So far we have a PE that supports bf16 - that&#8217;s very cool, but what about int8? Out of the box we don&#8217;t get support and we need to do a little extra work.</p><p>Obviously, we could just extend the PE with dedicated int8 multiplication logic, dedicated int8 addition logic, and call it a day. That <em>would </em>work but chips don&#8217;t grow on trees so let&#8217;s try to be a little more economical.</p><h2>int8 multiplication</h2><p>You might&#8217;ve already noticed earlier - we already have an 8x8 multiplier for mantissa multiplication! Mantissas are unsigned whereas for int8 support we&#8217;d like to also have signed multiplication, but this is doable. <em>One</em> <em>approach</em> (and I&#8217;m not sure if it&#8217;s the one actually used in production PEs) is to use 9x9 signed multiplier.</p><p>Let&#8217;s see why this works:</p><ol><li><p>uint8 can be promoted to int9 by adding a leading 0 bit (positive)</p></li><li><p>positive int8 can be promoted identically</p></li><li><p>negative int8 is a little nuanced, but it just requires copying the leading bit (sign extension)</p></li></ol><p>Let&#8217;s walk through the last case using -5 as an example. In two's complement, the most significant bit acts as a negative weight:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;markdown&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-markdown"># Two's complement -5
1111_1011 = -128 + 64 + 32 + 16 + 8 + 2 + 1 = -5
# Promoting to int9
1_1111_1011 = (-256 + 128) + 64 + 32 + 16 + 8 + 2 + 1 = -5</code></pre></div><p>Notice how after the promotion, the math automagically balances out! The first two bits (-256 + 128) collapse right back into the -128 we started with.</p><p>My understanding is that since we aren&#8217;t actually using the full range of possible int9 products, the synthesized multiplier can prune some logic. We also don&#8217;t need the rest of the bf16 multiplication logic, so those can be completely disabled in int8 mode as shown in the video below.</p><div class="native-video-embed" data-component-name="VideoPlaceholder" data-attrs="{&quot;mediaUploadId&quot;:&quot;27ecc14c-9108-4712-8e7a-2b5bf493368b&quot;,&quot;duration&quot;:null}"></div><p>Notice how we end up with a 16-bit integer product, which we can fit into the bottom half of the fp32 register.</p><h2>Addition</h2><p>In int8 mode, the partial sums are stored in int32 and the product is an int16. We simply bypass the fp32 addition logic completely (no longer needing the complex shifters) and add a dedicated adder.</p><div class="native-video-embed" data-component-name="VideoPlaceholder" data-attrs="{&quot;mediaUploadId&quot;:&quot;3d469a8b-bde8-4136-a647-8fb3085d0406&quot;,&quot;duration&quot;:null}"></div><p>Adders are much cheaper and since the int8 path is completely exclusive with the bf16 path, there&#8217;s likely quite a bit of slack even in a heavily pipelined implementation. An optimized implementation would take advantage of this slack to reduce the required chip area for the adder.</p><h1>Have we done it?</h1><p>Have we achieved 2x throughput? Sadly, not yet. However, this is basically what older TPUs up to v4 did [3].</p><p>Let&#8217;s see why exactly this achieves identical peak throughput as bf16 and use it as an opportunity to briefly discuss three concepts:</p><ul><li><p>Clock frequency</p></li><li><p>Critical path</p></li><li><p>Pipelining</p></li></ul><h2>Clock frequency and critical path</h2><p>The animation below illustrates the concepts of clock frequency and critical path. Notice how signal leaves the input registers on positive clock edge and has to propagate to the output registers before the next positive clock edge.</p><p>If the clock frequency is too high for the circuit, the signal doesn&#8217;t have time to settle into the output register. It&#8217;s common to have paths that are less complex but are required to operate at the global clock cycle speed determined by the critical path.</p><div class="native-video-embed" data-component-name="VideoPlaceholder" data-attrs="{&quot;mediaUploadId&quot;:&quot;b74ba551-0283-4f61-9d2c-69dde5430206&quot;,&quot;duration&quot;:null}"></div><p>If we relate this back to bf16 and int8 paths, in the current implementation it&#8217;s likely the shifters that lie on the critical path. The int8 logic is much simpler, but is forced to operate at the same clock speed.</p><h2>Pipelining</h2><p>As we&#8217;ve seen, critical path is what limits the chip&#8217;s clock frequency. There are two ways forward:</p><ol><li><p>Optimize critical path&#8217;s logic</p></li><li><p>Insert new registers into the critical path.</p></li></ol><p>Optimizing the critical path might require divine intervention and even then there might be limitations. Let&#8217;s take a look at option 2, commonly known as pipelining.</p><p>In the animation below, the bf16 path uses a 4-stage pipeline, whereas the int8 path uses only 2 stages. To be honest, I&#8217;m not sure what is common in production chips - either would work with some extra accounting needed for cycle-by-cycle differences between bf16 modes and int8 modes.</p><p>While watching the video, notice even though the latencies are different, the throughput at saturation is exactly the same. This drives home the idea that int8 mode on its own doesn&#8217;t provide extra throughput.</p><div class="native-video-embed" data-component-name="VideoPlaceholder" data-attrs="{&quot;mediaUploadId&quot;:&quot;fc82e631-bcbb-4c53-9a4b-ab8911a01e00&quot;,&quot;duration&quot;:null}"></div><p></p><h1>Subword parallelism</h1><p>We are on the final stretch, let&#8217;s keep pushing through.</p><p>So, how can we increase throughput to 2x? The obvious next thing to look at is that we are underutilizing our input wires by only using the bottom 8 bits in int8 mode. We can pack another int8 number there, the question remains which one?</p><p>Try to see if you can figure the answer out on your own before watching the animation below. </p><div class="native-video-embed" data-component-name="VideoPlaceholder" data-attrs="{&quot;mediaUploadId&quot;:&quot;be84a60f-a133-49bd-81ce-4ad2acdf96d9&quot;,&quot;duration&quot;:null}"></div><p>I suspect you see where this is going. We pack two neighboring (with respect to the dot product) int8s into a single PE. This technique&#8212;formally known as <strong>subword parallelism</strong> [5]&#8212;has actually been around since the 90s for multimedia acceleration. By applying it here, we just need the PE to do the elementwise multiplications in parallel and accumulate them. To do this, we need to add two things: a second 8x8 multiplier and a 3-way adder (to simultaneously sum product A, product B, and the incoming partial sum from the PE above)<strong>.</strong></p><p>The following animation illustrates the required changes.</p><div class="native-video-embed" data-component-name="VideoPlaceholder" data-attrs="{&quot;mediaUploadId&quot;:&quot;64c4f079-75e3-4cc2-822c-92e9f4d6c636&quot;,&quot;duration&quot;:null}"></div><p>This is pretty much exactly as one would expect, no surprises. Let&#8217;s talk briefly about the fact that we combine the two partial products into a single int32 accumulator. Since this operation is repeated many times, there&#8217;s a good question to ask: <em>can it overflow</em>?</p><p>Let&#8217;s take the worst-case: the largest possible product by the 8x8 multipliers is (-128) x (-128) = 16,384, which easily fits into int16. With subword parallelism, this doubles. Int32 has a max_val of 2^31 - 1, meaning we could repeat this operation 65k times before we have to worry about overflows.</p><p>Even if the systolic array was tiling over massive contracting dimension K, it would require K &gt; 130k to even reach the possibility of overflow while assuming the absolute worst scenario. In practice, some products will be positive, some negative, so these largely cancel out. Suffice to say, this will never overflow, but good to sanity check it.</p><h1>Cost</h1><p>We&#8217;ve finally achieved 2x throughput. As we&#8217;ve seen before, the 8x8 multiplier cost around 700 NAND2 gates, so the total multiplier cost doubles.</p><p>In my functional, <em>but not particularly optimized</em>, SystemVerilog implementations, adding support for subword parallelism increases the cost in terms of NAND2 gates by a little over 20%. Considering that it gives us 2x throughput that seems like a no-brainer and is exactly why int8 subword parallelism is so commonly supported on modern systolic arrays.</p><p></p><h2>Summary</h2><p>We&#8217;ve covered a lot! We&#8217;ve seen:</p><ul><li><p>what systolic arrays are</p></li><li><p>what processing elements do</p></li><li><p>how bf16 format works</p></li><li><p>how mantissa multipliers can be reused for int8 multiplication</p></li><li><p>how clock cycle, critical path, and pipelining are used</p></li><li><p>how subword parallelism achieves 2x throughput</p></li></ul><p>And crucially, we&#8217;ve answered the question we started with: why TPU v4 doesn't offer a 2x throughput increase at int8, and why newer chips do.</p><p></p><h2>References</h2><p>[1] H. T. Kung, "Why Systolic Architectures?," <em>Computer</em>, vol. 15, no. 1, pp. 37-46, Jan. 1982.</p><p>[2] J. L. Hennessy and D. A. Patterson, <em>Computer Architecture: A Quantitative Approach</em>, 6th ed. Cambridge, MA, USA: Morgan Kaufmann, 2017. <em>(Appendix J)</em></p><p>[3] N. P. Jouppi et al., "TPU v4: An Optically Reconfigurable Supercomputer for Machine Learning with Hardware Support for Embeddings," <em>Proc. ACM/IEEE 50th Annu. Int. Symp. Comput. Archit. (ISCA)</em>, 2023.</p><p>[4] R. B. Lee, "Subword Parallelism with MAX-2," <em>IEEE Micro</em>, vol. 16, no. 4, pp. 51-59, Aug. 1996.</p><p></p><div><hr></div><p></p><h3>A Note from Michal</h3><p>Hardware architecture can get pretty heavy, so my goal was to keep this as approachable as possible. I hope you learned something new about how ML chips tick and enjoyed the animations along the way! For me, understanding how the chips I work with on a daily basis work under the hood is super rewarding!</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://michalpitr.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading From Scratch! If you&#8217;ve read this far, I&#8217;m sure you&#8217;ll enjoy my other work. Please consider subcribing.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p><strong>Enjoyed this article?</strong></p><ul><li><p><strong>Subscribe to the Substack:</strong> If you haven&#8217;t already, consider subscribing!</p></li><li><p><strong>Check out Animated Compute:</strong> Content for this article is entirely based on the research I did for my latest <a href="https://www.youtube.com/@AnimatedCompute">YouTube video</a>. I&#8217;m planning to post more videos and work on improving the production quality, so I hope you&#8217;ll subscribe :)</p></li></ul><div class="captioned-button-wrap" data-attrs="{&quot;url&quot;:&quot;https://michalpitr.substack.com/p/subword-parallelism?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;}" data-component-name="CaptionedButtonToDOM"><div class="preamble"><p class="cta-caption">Got an ML infra friend or colleague to share this article with? </p></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://michalpitr.substack.com/p/subword-parallelism?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://michalpitr.substack.com/p/subword-parallelism?utm_source=substack&utm_medium=email&utm_content=share&action=share"><span>Share</span></a></p></div><p><em>After a well-deserved break for reading this far, perhaps check out some of my older articles.</em></p><div class="digest-post-embed" data-attrs="{&quot;nodeId&quot;:&quot;0117f0ef-9a33-425b-acce-b1f7e8becb7a&quot;,&quot;caption&quot;:&quot;This article is all about performance optimizations - squeezing as much performance out of my CPU as I can.&quot;,&quot;cta&quot;:null,&quot;showBylines&quot;:true,&quot;size&quot;:&quot;md&quot;,&quot;isEditorNode&quot;:true,&quot;title&quot;:&quot;Optimizing matrix multiplication&quot;,&quot;publishedBylines&quot;:[{&quot;id&quot;:28235731,&quot;name&quot;:&quot;Michal Pitr&quot;,&quot;bio&quot;:&quot;I write engineering deep dives on how I build complex software from scratch. &quot;,&quot;photo_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/73a6e774-038e-424d-afee-ce5041c3e7e0_1080x1080.jpeg&quot;,&quot;is_guest&quot;:false,&quot;bestseller_tier&quot;:null}],&quot;post_date&quot;:&quot;2025-02-15T17:34:18.733Z&quot;,&quot;cover_image&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7c7ba7fa-8377-499d-aabd-fb32c43df468_999x965.png&quot;,&quot;cover_image_alt&quot;:null,&quot;canonical_url&quot;:&quot;https://michalpitr.substack.com/p/optimizing-matrix-multiplication&quot;,&quot;section_name&quot;:null,&quot;video_upload_id&quot;:null,&quot;id&quot;:156238775,&quot;type&quot;:&quot;newsletter&quot;,&quot;reaction_count&quot;:86,&quot;comment_count&quot;:0,&quot;publication_id&quot;:1939983,&quot;publication_name&quot;:&quot;From Scratch&quot;,&quot;publication_logo_url&quot;:&quot;https://substackcdn.com/image/fetch/$s_!-HWp!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbe5ab36b-c782-420e-ac35-c7599a1f77ad_976x976.png&quot;,&quot;belowTheFold&quot;:true,&quot;youtube_url&quot;:null,&quot;show_links&quot;:null,&quot;feed_url&quot;:null}"></div><div class="digest-post-embed" data-attrs="{&quot;nodeId&quot;:&quot;4d298d39-27fb-4f05-9517-036b7cba6eeb&quot;,&quot;caption&quot;:&quot;One of the great joys of software engineering is dispelling magic. I&#8217;ve written code that executed on a GPU using frameworks like PyTorch or TensorFlow, but I never understood the &#8220;how&#8221;. It&#8217;s time to dispel the magic of GPU programming and learn how it works under the hood.&quot;,&quot;cta&quot;:null,&quot;showBylines&quot;:true,&quot;size&quot;:&quot;md&quot;,&quot;isEditorNode&quot;:true,&quot;title&quot;:&quot;GPU Programming&quot;,&quot;publishedBylines&quot;:[{&quot;id&quot;:28235731,&quot;name&quot;:&quot;Michal Pitr&quot;,&quot;bio&quot;:&quot;I write engineering deep dives on how I build complex software from scratch. &quot;,&quot;photo_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/73a6e774-038e-424d-afee-ce5041c3e7e0_1080x1080.jpeg&quot;,&quot;is_guest&quot;:false,&quot;bestseller_tier&quot;:null}],&quot;post_date&quot;:&quot;2024-05-04T19:25:04.223Z&quot;,&quot;cover_image&quot;:&quot;https://substackcdn.com/image/fetch/$s_!ksVf!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F60744dcf-3477-4aa5-9cdb-f020a0ba874c_2472x888.png&quot;,&quot;cover_image_alt&quot;:null,&quot;canonical_url&quot;:&quot;https://michalpitr.substack.com/p/gpu-programming&quot;,&quot;section_name&quot;:null,&quot;video_upload_id&quot;:null,&quot;id&quot;:144305968,&quot;type&quot;:&quot;newsletter&quot;,&quot;reaction_count&quot;:44,&quot;comment_count&quot;:1,&quot;publication_id&quot;:1939983,&quot;publication_name&quot;:&quot;From Scratch&quot;,&quot;publication_logo_url&quot;:&quot;https://substackcdn.com/image/fetch/$s_!-HWp!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbe5ab36b-c782-420e-ac35-c7599a1f77ad_976x976.png&quot;,&quot;belowTheFold&quot;:true,&quot;youtube_url&quot;:null,&quot;show_links&quot;:null,&quot;feed_url&quot;:null}"></div><div class="digest-post-embed" data-attrs="{&quot;nodeId&quot;:&quot;493452b4-21e5-4c92-bf50-f933dbc808ea&quot;,&quot;caption&quot;:&quot;Recently I&#8217;ve been implementing a subset of SQLite (the world&#8217;s most used database, btw) from scratch in Go. I&#8217;ll share what I&#8217;ve learned about how SQLite stores data on disk which will help us understand key database concepts. Thanks for reading Michal&#8217;s Substack! Subscribe for free to receive new posts and support my work.&quot;,&quot;cta&quot;:null,&quot;showBylines&quot;:true,&quot;size&quot;:&quot;md&quot;,&quot;isEditorNode&quot;:true,&quot;title&quot;:&quot;How does SQLite store data?&quot;,&quot;publishedBylines&quot;:[{&quot;id&quot;:28235731,&quot;name&quot;:&quot;Michal Pitr&quot;,&quot;bio&quot;:&quot;I write engineering deep dives on how I build complex software from scratch. &quot;,&quot;photo_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/73a6e774-038e-424d-afee-ce5041c3e7e0_1080x1080.jpeg&quot;,&quot;is_guest&quot;:false,&quot;bestseller_tier&quot;:null}],&quot;post_date&quot;:&quot;2024-03-17T16:50:48.826Z&quot;,&quot;cover_image&quot;:&quot;https://substackcdn.com/image/fetch/$s_!xgg8!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc850055-defb-4bdb-8da8-779e45b482f5_997x387.png&quot;,&quot;cover_image_alt&quot;:null,&quot;canonical_url&quot;:&quot;https://michalpitr.substack.com/p/how-does-sqlite-store-data&quot;,&quot;section_name&quot;:null,&quot;video_upload_id&quot;:null,&quot;id&quot;:142692526,&quot;type&quot;:&quot;newsletter&quot;,&quot;reaction_count&quot;:43,&quot;comment_count&quot;:1,&quot;publication_id&quot;:1939983,&quot;publication_name&quot;:&quot;From Scratch&quot;,&quot;publication_logo_url&quot;:&quot;https://substackcdn.com/image/fetch/$s_!-HWp!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbe5ab36b-c782-420e-ac35-c7599a1f77ad_976x976.png&quot;,&quot;belowTheFold&quot;:true,&quot;youtube_url&quot;:null,&quot;show_links&quot;:null,&quot;feed_url&quot;:null}"></div><p><br></p><p></p><p></p>]]></content:encoded></item><item><title><![CDATA[Paper Journal: Disaggregated Serving]]></title><description><![CDATA[Achieving consistent online LLM inference performance]]></description><link>https://michalpitr.substack.com/p/paper-journal-disaggragated-serving</link><guid isPermaLink="false">https://michalpitr.substack.com/p/paper-journal-disaggragated-serving</guid><dc:creator><![CDATA[Michal Pitr]]></dc:creator><pubDate>Sun, 16 Mar 2025 20:51:31 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9603b7a1-30cf-4bfe-b8c3-276c3a0168da_3794x1965.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>In this article, I won&#8217;t build anything <em>From Scratch</em>. Instead, I&#8217;ll cover a recent research paper <em><a href="https://arxiv.org/abs/2401.09670">DistServe</a>: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving</em> whose findings have since been implemented in vLLM.<br><br>DistServe tackles one of the most critical challenges in LLM deployment today: how to maintain consistent performance under high request loads. By disaggregating the prefill and decode stages of LLM inference, DistServe can satisfy the same Service Level Objectives (SLOs) at up to <strong>7x higher request rates</strong> compared to traditional approaches.</p><p>This innovation is particularly important for LLM API providers like Google or OpenAI, where predictable performance is essential for user experience.</p><p>In this post, I&#8217;ll provide a short overview of generative LLM inference and go over DistServe&#8217;s internals.</p><p></p><h3>Generating text with LLMs</h3><p>Let me briefly illustrate how the generative process works. This will be important to appreciate the core insight behind DistServe.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ZJgf!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fea830893-3654-4981-923b-1be5f33ae426_7088x1156.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ZJgf!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fea830893-3654-4981-923b-1be5f33ae426_7088x1156.png 424w, https://substackcdn.com/image/fetch/$s_!ZJgf!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fea830893-3654-4981-923b-1be5f33ae426_7088x1156.png 848w, https://substackcdn.com/image/fetch/$s_!ZJgf!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fea830893-3654-4981-923b-1be5f33ae426_7088x1156.png 1272w, https://substackcdn.com/image/fetch/$s_!ZJgf!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fea830893-3654-4981-923b-1be5f33ae426_7088x1156.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ZJgf!,w_2400,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fea830893-3654-4981-923b-1be5f33ae426_7088x1156.png" width="1458" height="237.32554945054946" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ea830893-3654-4981-923b-1be5f33ae426_7088x1156.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:false,&quot;imageSize&quot;:&quot;large&quot;,&quot;height&quot;:237,&quot;width&quot;:1456,&quot;resizeWidth&quot;:1458,&quot;bytes&quot;:563661,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://michalpitr.substack.com/i/158716277?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fea830893-3654-4981-923b-1be5f33ae426_7088x1156.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-large" alt="" srcset="https://substackcdn.com/image/fetch/$s_!ZJgf!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fea830893-3654-4981-923b-1be5f33ae426_7088x1156.png 424w, https://substackcdn.com/image/fetch/$s_!ZJgf!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fea830893-3654-4981-923b-1be5f33ae426_7088x1156.png 848w, https://substackcdn.com/image/fetch/$s_!ZJgf!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fea830893-3654-4981-923b-1be5f33ae426_7088x1156.png 1272w, https://substackcdn.com/image/fetch/$s_!ZJgf!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fea830893-3654-4981-923b-1be5f33ae426_7088x1156.png 1456w" sizes="100vw" fetchpriority="high"></picture><div></div></div></a></figure></div><p>Decoder-only models are next-token predictors, in other words, every forward pass assigns a likelihood to each token in the model&#8217;s vocabulary. Then depending on the sampling function used, one of the most-likely tokens is selected. In the example above I&#8217;m using simple greedy sampling.</p><blockquote><p><em>I&#8217;ll be using the terms <strong>token</strong> and <strong>word</strong> interchangeably.</em></p></blockquote><p>A naive way to implement the generative loop would be to take the token generated at step 0, append it to the user input, rerun the whole modified input through the model, and repeat until termination. As you can imagine, this would be prohibitively expensive.</p><p>Decoder-only transformers commonly use causal attention &#8212; a context mechanism where the hidden representations of input tokens are functions of the hidden representations of preceding tokens and of the current token.</p><p>This allows inference engines to cache all of the intermediate tensors that will be required in subsequent generation passes. For instance, on the first iteration of the example above, all intermediate tensors that are required for the attention layer will be cached. On the second iteration, it&#8217;s sufficient to compute the forward pass just for the token &#8220;<em>Apples</em>&#8221; and substitute in the cached tensors where required.</p><blockquote><p><em>This technique is known as <strong>KV (Key-Value) caching</strong> since the intermediate tensors that need to be cached are the K, V attention tensors.</em></p></blockquote><p></p><h3>Two phases of LLM inference</h3><p>You might notice that not all steps in the generative loop above are created equal. The first step, often called the <strong>prefill </strong>or<strong> context phase</strong>, is much more computationally intensive. </p><p>Similar to how website load times are measured, LLM API providers monitor the time to first token (TTFT).</p><blockquote><p><em>It&#8217;s customary to express SLOs in percentiles. For instance, 50ms p99 TTFT would mean that the service has to generate the first token within 50ms for 99% of requests.</em></p></blockquote><p>In contrast, the subsequent <strong>decoding phase</strong> benefits from KV caching and is more often memory-bound. A common SLO here is the inter-token latency (ITL).</p><p>Taken together, the <strong>TTFT</strong> + <strong>ITL</strong> * <strong>num_generated_tokens</strong> gives the overall request latency.</p><p></p><h3>vLLM setup</h3><p>Let&#8217;s walk through a reasonable inference setup with vLLM. vLLM and DistServe serve a similar purpose, so understanding vLLM will help us appreciate DistServe&#8217;s differences.</p><blockquote><p>vLLM is a popular LLM inference library initially developed at UC Berkeley that was the first to introduce paged attention for improving GPU memory utilization. Since then, vLLM has incorporated many common LLM inference optimizations, for instance, quantization, prefix caching, or speculative decoding.</p></blockquote><p>Let&#8217;s suppose we have a single machine with two GPUs and each GPU is large enough to fit the entire model we want to serve, say a llama-3-8B.</p><p>A pretty reasonable inference setup might look something like this:</p><ol><li><p>Run a separate vLLM <strong>llama-3-8B</strong> instance on each GPU.</p></li><li><p>Set up a load balancer in front of these two model instances to distribute incoming requests.</p><p></p></li></ol><p>The key limitation here is that both prefill and decode phases run on the same GPU. </p><p>vLLM implements continuous batching, where new requests can join the active batch on each generation step, significantly improving throughput. However, this creates a non-obvious issue: requests in the computationally intensive prefill stage can significantly slow down decode operations happening in the same batch. </p><p>This makes it difficult to optimize the system for both TTFT and ITL at the same time. While other mitigations have been proposed, for instance, <a href="https://developer.nvidia.com/blog/streamlining-ai-inference-performance-and-deployment-with-nvidia-tensorrt-llm-chunked-prefill/">chunked prefill</a>, they have known limitations.</p><p></p><h2>DistServe</h2><p>The folks behind DistServe noticed this prefill-decode interference and came up with a clever idea to split the two phases. Disaggregating the two phases allowed them to guarantee more predictable TTFT and ITL, allowing them to maintain the same SLOs at several times higher request rates than vLLM.</p><p>Splitting these isn&#8217;t trivial &#8212; the prefill phase generates KV cache blocks required for the decode phase so they had to come up with a mechanism for moving cache blocks between instances.</p><p>DistServe is implemented on top of <a href="https://docs.ray.io/en/latest/index.html">Ray</a>, a popular orchestration platform for Python workloads. DistServe uses Ray actors and placement groups to assign prefill and decode worker instances to GPUs in a Ray cluster.</p><p>Let&#8217;s explore how exactly DistServe works.</p><p></p><h3>Life of a request</h3><p>DistServe is implemented as a single API server Python process that spawns and orchestrates Ray workers.</p><p>Requests sent to DistServe&#8217;s <strong>/generate </strong>endpoint are added to a prefill queue. The prefill-stage controller has an event loop that on each iteration tries to:</p><ol><li><p>Create a new batch from requests in the prefill queue.</p></li><li><p>If a batch can be created, it reserves the memory blocks.</p></li><li><p>Dispatch the batch request to all prefill workers.</p></li><li><p>Wait for the single prefill forward pass to finish.</p></li><li><p>Move finished requests to a bridge queue for the decode stage to handle them.</p></li></ol><p>Let me elaborate on the memory blocks. DistServe, like vLLM, implements <a href="https://blog.vllm.ai/2023/06/20/vllm.html">paged attention</a> &#8212; memory for KV cache is partitioned into blocks akin to operating system memory pages. DistServe&#8217;s block manager keeps track of all memory blocks. Information about block usage is then used by the scheduler when constructing a new batch.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!-m1c!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F96522ccd-80e0-4a6d-b1ad-a6b16435c778_1200x590.gif" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!-m1c!,w_424,c_limit,f_webp,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F96522ccd-80e0-4a6d-b1ad-a6b16435c778_1200x590.gif 424w, https://substackcdn.com/image/fetch/$s_!-m1c!,w_848,c_limit,f_webp,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F96522ccd-80e0-4a6d-b1ad-a6b16435c778_1200x590.gif 848w, https://substackcdn.com/image/fetch/$s_!-m1c!,w_1272,c_limit,f_webp,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F96522ccd-80e0-4a6d-b1ad-a6b16435c778_1200x590.gif 1272w, https://substackcdn.com/image/fetch/$s_!-m1c!,w_1456,c_limit,f_webp,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F96522ccd-80e0-4a6d-b1ad-a6b16435c778_1200x590.gif 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!-m1c!,w_1456,c_limit,f_auto,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F96522ccd-80e0-4a6d-b1ad-a6b16435c778_1200x590.gif" width="728" height="357.93333333333334" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/96522ccd-80e0-4a6d-b1ad-a6b16435c778_1200x590.gif&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:590,&quot;width&quot;:1200,&quot;resizeWidth&quot;:728,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!-m1c!,w_424,c_limit,f_auto,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F96522ccd-80e0-4a6d-b1ad-a6b16435c778_1200x590.gif 424w, https://substackcdn.com/image/fetch/$s_!-m1c!,w_848,c_limit,f_auto,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F96522ccd-80e0-4a6d-b1ad-a6b16435c778_1200x590.gif 848w, https://substackcdn.com/image/fetch/$s_!-m1c!,w_1272,c_limit,f_auto,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F96522ccd-80e0-4a6d-b1ad-a6b16435c778_1200x590.gif 1272w, https://substackcdn.com/image/fetch/$s_!-m1c!,w_1456,c_limit,f_auto,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F96522ccd-80e0-4a6d-b1ad-a6b16435c778_1200x590.gif 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Illustration of KV cache blocks. Taken from <a href="https://blog.vllm.ai/2023/06/20/vllm.html">vLLM&#8217;s page</a>.</figcaption></figure></div><p>The new batch is then dispatched to all prefill workers by using Ray&#8217;s <strong>remote()</strong> call. DistServe supports tensor and pipeline parallelism for each stage separately. In this walkthrough, I&#8217;ll be assuming no tensor or pipeline parallelism so there&#8217;s exactly one worker for each phase.</p><blockquote><p><em>Being able to configure tensor and pipeline parallelism settings for each stage independently is yet another benefit of splitting prefill and decode stages.</em></p></blockquote><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!S3vv!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9603b7a1-30cf-4bfe-b8c3-276c3a0168da_3794x1965.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!S3vv!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9603b7a1-30cf-4bfe-b8c3-276c3a0168da_3794x1965.png 424w, https://substackcdn.com/image/fetch/$s_!S3vv!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9603b7a1-30cf-4bfe-b8c3-276c3a0168da_3794x1965.png 848w, https://substackcdn.com/image/fetch/$s_!S3vv!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9603b7a1-30cf-4bfe-b8c3-276c3a0168da_3794x1965.png 1272w, https://substackcdn.com/image/fetch/$s_!S3vv!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9603b7a1-30cf-4bfe-b8c3-276c3a0168da_3794x1965.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!S3vv!,w_2400,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9603b7a1-30cf-4bfe-b8c3-276c3a0168da_3794x1965.png" width="1270" height="657.6785714285714" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/9603b7a1-30cf-4bfe-b8c3-276c3a0168da_3794x1965.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:false,&quot;imageSize&quot;:&quot;large&quot;,&quot;height&quot;:754,&quot;width&quot;:1456,&quot;resizeWidth&quot;:1270,&quot;bytes&quot;:468368,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://michalpitr.substack.com/i/158716277?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9603b7a1-30cf-4bfe-b8c3-276c3a0168da_3794x1965.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-large" alt="" srcset="https://substackcdn.com/image/fetch/$s_!S3vv!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9603b7a1-30cf-4bfe-b8c3-276c3a0168da_3794x1965.png 424w, https://substackcdn.com/image/fetch/$s_!S3vv!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9603b7a1-30cf-4bfe-b8c3-276c3a0168da_3794x1965.png 848w, https://substackcdn.com/image/fetch/$s_!S3vv!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9603b7a1-30cf-4bfe-b8c3-276c3a0168da_3794x1965.png 1272w, https://substackcdn.com/image/fetch/$s_!S3vv!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9603b7a1-30cf-4bfe-b8c3-276c3a0168da_3794x1965.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">DistServe architecture</figcaption></figure></div><p></p><p>When the prefill forward step finishes, requests are moved to DistServe&#8217;s bridge queue. DistServe has a pretty complex queue setup on the decoder side, so I&#8217;ve simplified this significantly in the diagram above.<br><br>In a nutshell, a request that has finished the prefill phase will stay waiting in a variety of queues until two conditions are met:</p><ol><li><p>Its KV cache blocks can be moved from the prefill worker to the decode worker.</p></li><li><p>Once KV cache blocks are moved to the decode worker, the scheduler can try to add the request to the next continuously batched batch.</p></li></ol><p>The KV cache migration from a prefill worker to a decode worker is interesting. Once DistServe determines that the decode worker has enough free GPU memory, it will send a request to the decode worker to pull relevant tensors from the prefill worker. DistServe uses a custom execution engine SwiftTransformer in which they implement <a href="https://github.com/LLMServe/SwiftTransformer/blob/main/src/csrc/util/py_block_migration.cc#L139-L252">support for block migration</a> via cudaMemcpyAsync.</p><blockquote><p><em>Interestingly, DistServe authors found that the transfer overhead even for a relatively large 175B model was only around 0.1% of the total latency. This is largely thanks to high-bandwidth InfiniBand and NVLink GPU-GPU communication links.</em></p></blockquote><p>DistServe has to ensure that the transfer finishes before the corresponding request can be handled by the decode worker. Once the transfer finishes, the prefill worker can free those memory blocks. </p><p>The decode scheduler uses continuous batching. Since each decoding request might require a variable number of iterations to terminate, on each iteration it tries to add a new request whose KV cache blocks were already transferred to the GPU. If memory requirements of existing requests grow more than expected, most recently added requests can be temporarily swapped to main memory.</p><p>On each iteration, predicted tokens can be streamed back to users. Once a request is finished, all associated memory blocks can be freed or cached in cheaper storage. Caching can be useful to avoid recomputing everything for chat-like use-cases.</p><p></p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://michalpitr.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading From Scratch! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p></p><h2>Summary</h2><p>DistServe is an interesting paper that has gathered well-deserved interest from hyperscalers like Google and OpenAI. By splitting the prefill and decode phases across separate workers, it resolves a fundamental issue with traditional serving systems.</p><p>For LLM API providers, the lessons from DistServe can likely lead to a reduced need to overprovision hardware to deal with prefill &#215; decode interference. Disaggregating prefill and decode has since been <a href="https://docs.vllm.ai/en/latest/features/disagg_prefill.html">implemented in vLLM</a>, further showing the relevancy of this finding.</p><div><hr></div><p>I typically build interesting software <em>From Scratch</em> in my articles. With my background in LLM inference and Kubernetes, DistServe lies in a particularly fun problem space for me, so I made an exception and covered the paper. If there&#8217;s interest, I might do these more often :)</p><p>If you enjoy deep dives like this, consider subscribing! I&#8217;m also somewhat active on <a href="https://www.linkedin.com/in/michal-pitr/">LinkedIn</a>, so consider connecting with me there. If you enjoyed this article, you also might enjoy some of the pinned articles below!</p><div class="captioned-button-wrap" data-attrs="{&quot;url&quot;:&quot;https://michalpitr.substack.com/p/optimizing-matrix-multiplication?utm_source=substack&amp;utm_medium=email&amp;utm_content=share&amp;action=share&amp;token=eyJ1c2VyX2lkIjoyODIzNTczMSwicG9zdF9pZCI6MTU2MjM4Nzc1LCJpYXQiOjE3NDIxNTYyMTcsImV4cCI6MTc0NDc0ODIxNywiaXNzIjoicHViLTE5Mzk5ODMiLCJzdWIiOiJwb3N0LXJlYWN0aW9uIn0.-2Zs6TL71851PUPVC3GZp-CGviivt-m8eG3bbBXTS84&quot;,&quot;text&quot;:&quot;Share&quot;}" data-component-name="CaptionedButtonToDOM"><div class="preamble"><p class="cta-caption">Do you have a friend who might enjoy this article? Consider sharing it with them.</p></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://michalpitr.substack.com/p/paper-journal-disaggragated-serving?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://michalpitr.substack.com/p/paper-journal-disaggragated-serving?utm_source=substack&utm_medium=email&utm_content=share&action=share"><span>Share</span></a></p></div><div class="digest-post-embed" data-attrs="{&quot;nodeId&quot;:&quot;7ecc89b7-1a2e-4fb6-9df6-7d35b149046b&quot;,&quot;caption&quot;:&quot;I like to keep things practical. Let&#8217;s train a simple neural network, save the model, and write an inference engine that can execute inputs against the model. Sounds like a fun time to me!&quot;,&quot;cta&quot;:null,&quot;showBylines&quot;:true,&quot;size&quot;:&quot;md&quot;,&quot;isEditorNode&quot;:true,&quot;title&quot;:&quot;Build Your Own Inference Engine: From Scratch to \&quot;7\&quot;&quot;,&quot;publishedBylines&quot;:[{&quot;id&quot;:28235731,&quot;name&quot;:&quot;Michal Pitr&quot;,&quot;bio&quot;:&quot;I write engineering deep dives on how I build complex software from scratch. &quot;,&quot;photo_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/73a6e774-038e-424d-afee-ce5041c3e7e0_1080x1080.jpeg&quot;,&quot;is_guest&quot;:false,&quot;bestseller_tier&quot;:null}],&quot;post_date&quot;:&quot;2024-08-04T15:27:57.810Z&quot;,&quot;cover_image&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6e6e7b0d-48b1-4dc0-886c-96b2d844f2dc_1340x1106.png&quot;,&quot;cover_image_alt&quot;:null,&quot;canonical_url&quot;:&quot;https://michalpitr.substack.com/p/build-your-own-inference-engine-from&quot;,&quot;section_name&quot;:null,&quot;video_upload_id&quot;:null,&quot;id&quot;:147338023,&quot;type&quot;:&quot;newsletter&quot;,&quot;reaction_count&quot;:16,&quot;comment_count&quot;:0,&quot;publication_id&quot;:null,&quot;publication_name&quot;:&quot;From Scratch&quot;,&quot;publication_logo_url&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbe5ab36b-c782-420e-ac35-c7599a1f77ad_976x976.png&quot;,&quot;belowTheFold&quot;:true,&quot;youtube_url&quot;:null,&quot;show_links&quot;:null,&quot;feed_url&quot;:null}"></div><div class="digest-post-embed" data-attrs="{&quot;nodeId&quot;:&quot;42e91f9f-e5ce-42f3-8219-4c459c9e0ac5&quot;,&quot;caption&quot;:&quot;This article is all about performance optimizations - squeezing as much performance out of my CPU as I can.&quot;,&quot;cta&quot;:null,&quot;showBylines&quot;:true,&quot;size&quot;:&quot;md&quot;,&quot;isEditorNode&quot;:true,&quot;title&quot;:&quot;Optimizing matrix multiplication&quot;,&quot;publishedBylines&quot;:[{&quot;id&quot;:28235731,&quot;name&quot;:&quot;Michal Pitr&quot;,&quot;bio&quot;:&quot;I write engineering deep dives on how I build complex software from scratch. &quot;,&quot;photo_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/73a6e774-038e-424d-afee-ce5041c3e7e0_1080x1080.jpeg&quot;,&quot;is_guest&quot;:false,&quot;bestseller_tier&quot;:null}],&quot;post_date&quot;:&quot;2025-02-15T17:34:18.733Z&quot;,&quot;cover_image&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7c7ba7fa-8377-499d-aabd-fb32c43df468_999x965.png&quot;,&quot;cover_image_alt&quot;:null,&quot;canonical_url&quot;:&quot;https://michalpitr.substack.com/p/optimizing-matrix-multiplication&quot;,&quot;section_name&quot;:null,&quot;video_upload_id&quot;:null,&quot;id&quot;:156238775,&quot;type&quot;:&quot;newsletter&quot;,&quot;reaction_count&quot;:32,&quot;comment_count&quot;:0,&quot;publication_id&quot;:null,&quot;publication_name&quot;:&quot;From Scratch&quot;,&quot;publication_logo_url&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbe5ab36b-c782-420e-ac35-c7599a1f77ad_976x976.png&quot;,&quot;belowTheFold&quot;:true,&quot;youtube_url&quot;:null,&quot;show_links&quot;:null,&quot;feed_url&quot;:null}"></div><div class="digest-post-embed" data-attrs="{&quot;nodeId&quot;:&quot;313ece31-dca2-4305-aedc-1f21ae3bed57&quot;,&quot;caption&quot;:&quot;I recently built a docker clone from scratch in Go. This made me wonder - how hard would it be to do the same step-by-step in a terminal? Let&#8217;s find out!&quot;,&quot;cta&quot;:null,&quot;showBylines&quot;:true,&quot;size&quot;:&quot;md&quot;,&quot;isEditorNode&quot;:true,&quot;title&quot;:&quot;Linux container from scratch&quot;,&quot;publishedBylines&quot;:[{&quot;id&quot;:28235731,&quot;name&quot;:&quot;Michal Pitr&quot;,&quot;bio&quot;:&quot;I write engineering deep dives on how I build complex software from scratch. &quot;,&quot;photo_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/73a6e774-038e-424d-afee-ce5041c3e7e0_1080x1080.jpeg&quot;,&quot;is_guest&quot;:false,&quot;bestseller_tier&quot;:null}],&quot;post_date&quot;:&quot;2024-12-07T17:59:48.898Z&quot;,&quot;cover_image&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/83b1135a-5a37-4906-867e-d524af8aae2b_1792x1024.webp&quot;,&quot;cover_image_alt&quot;:null,&quot;canonical_url&quot;:&quot;https://michalpitr.substack.com/p/linux-container-from-scratch&quot;,&quot;section_name&quot;:null,&quot;video_upload_id&quot;:null,&quot;id&quot;:152362649,&quot;type&quot;:&quot;newsletter&quot;,&quot;reaction_count&quot;:64,&quot;comment_count&quot;:0,&quot;publication_id&quot;:null,&quot;publication_name&quot;:&quot;From Scratch&quot;,&quot;publication_logo_url&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbe5ab36b-c782-420e-ac35-c7599a1f77ad_976x976.png&quot;,&quot;belowTheFold&quot;:true,&quot;youtube_url&quot;:null,&quot;show_links&quot;:null,&quot;feed_url&quot;:null}"></div>]]></content:encoded></item><item><title><![CDATA[Optimizing matrix multiplication]]></title><description><![CDATA[Discovering optimizations one at a time]]></description><link>https://michalpitr.substack.com/p/optimizing-matrix-multiplication</link><guid isPermaLink="false">https://michalpitr.substack.com/p/optimizing-matrix-multiplication</guid><dc:creator><![CDATA[Michal Pitr]]></dc:creator><pubDate>Sat, 15 Feb 2025 17:34:18 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7c7ba7fa-8377-499d-aabd-fb32c43df468_999x965.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>This article is all about performance optimizations - squeezing as much performance out of my CPU as I can.<br><br>I&#8217;ll start with a naive matrix multiplication in C and then iteratively improve it until my implementation approaches that of AMD&#8217;s bli_dgemm. My goal is not just to present optimizations, but rather for you to discover them with me.</p><p>As we go through this, we&#8217;ll learn a thing or two about compilers, assembly, and underlying hardware.</p><blockquote><p><em>For the sake of ever finishing this article, I&#8217;ll focus purely on single threaded optimizations and will tackle parallelization in the future.</em></p></blockquote><h2>Setup</h2><p>I&#8217;m running a Ryzen 5600H processor with 6 cores and12 threads. Each core has 32 KiB of L1d and L1i cache and 512 KiB of L2 cache. All cores share 16 MiB of L3 cache.</p><p>I&#8217;ll be using clang 18.1.3 on Ubuntu 24.04 throughout this article. </p><h2>Matrix multiplication review</h2><p>Given two matrices, <strong>A</strong> of <code>n</code> rows and <code>k</code> columns (<code>(n,k)</code> from now on) and a <code>(k,m) </code>matrix <strong>B</strong> of, the product of <strong>AB </strong>is an <code>(n,m)</code> matrix <strong>C</strong>. The element <strong>C</strong>[r][c] is defined as a the dot product of row <code>r</code> of A with the column <code>c</code> of B. </p><p>In summation notation:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;c_{r,c} = \\sum_{i=1}^{k} a_{r,i} b_{i,c} &quot;,&quot;id&quot;:&quot;DVQKDJDWEC&quot;}" data-component-name="LatexBlockToDOM"></div><p></p><p>Let&#8217;s use the mathematical definition as the starting point.</p><h2>Naive implementation</h2><p>The naive implementation is a direct translation of the mathematical definition to C. For simplicity&#8217;s sake, I&#8217;ll assume that matrices are square and their size is a power of 2.</p><p> Take a look to see if you spot anything surprising. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!9cDX!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F593f7bfc-e4b8-4821-aab5-b9e84db86edc_3372x2056.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!9cDX!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F593f7bfc-e4b8-4821-aab5-b9e84db86edc_3372x2056.png 424w, https://substackcdn.com/image/fetch/$s_!9cDX!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F593f7bfc-e4b8-4821-aab5-b9e84db86edc_3372x2056.png 848w, https://substackcdn.com/image/fetch/$s_!9cDX!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F593f7bfc-e4b8-4821-aab5-b9e84db86edc_3372x2056.png 1272w, https://substackcdn.com/image/fetch/$s_!9cDX!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F593f7bfc-e4b8-4821-aab5-b9e84db86edc_3372x2056.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!9cDX!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F593f7bfc-e4b8-4821-aab5-b9e84db86edc_3372x2056.png" width="648" height="395.2087912087912" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/593f7bfc-e4b8-4821-aab5-b9e84db86edc_3372x2056.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:888,&quot;width&quot;:1456,&quot;resizeWidth&quot;:648,&quot;bytes&quot;:1426809,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!9cDX!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F593f7bfc-e4b8-4821-aab5-b9e84db86edc_3372x2056.png 424w, https://substackcdn.com/image/fetch/$s_!9cDX!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F593f7bfc-e4b8-4821-aab5-b9e84db86edc_3372x2056.png 848w, https://substackcdn.com/image/fetch/$s_!9cDX!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F593f7bfc-e4b8-4821-aab5-b9e84db86edc_3372x2056.png 1272w, https://substackcdn.com/image/fetch/$s_!9cDX!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F593f7bfc-e4b8-4821-aab5-b9e84db86edc_3372x2056.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Naive matrix multiplication</figcaption></figure></div><p>The restrict keyword tells the compiler that no other pointer will access the object, which in turn can allow the compiler to generate more optimized code.</p><p>Let me explain what column-major order is.</p><h3>Storing matrices in memory</h3><p>Column-major order is a flat representation of a matrix in memory such that columns are laid out sequentially. This approach is used by Fortran and many BLAS libraries, including AMD&#8217;s BLI.</p><blockquote><p>The default ordering of 2D arrays in C/C++ is row major.</p></blockquote><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!iN_f!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7c7ba7fa-8377-499d-aabd-fb32c43df468_999x965.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!iN_f!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7c7ba7fa-8377-499d-aabd-fb32c43df468_999x965.png 424w, https://substackcdn.com/image/fetch/$s_!iN_f!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7c7ba7fa-8377-499d-aabd-fb32c43df468_999x965.png 848w, https://substackcdn.com/image/fetch/$s_!iN_f!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7c7ba7fa-8377-499d-aabd-fb32c43df468_999x965.png 1272w, https://substackcdn.com/image/fetch/$s_!iN_f!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7c7ba7fa-8377-499d-aabd-fb32c43df468_999x965.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!iN_f!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7c7ba7fa-8377-499d-aabd-fb32c43df468_999x965.png" width="436" height="421.16116116116115" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/7c7ba7fa-8377-499d-aabd-fb32c43df468_999x965.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:965,&quot;width&quot;:999,&quot;resizeWidth&quot;:436,&quot;bytes&quot;:330149,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!iN_f!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7c7ba7fa-8377-499d-aabd-fb32c43df468_999x965.png 424w, https://substackcdn.com/image/fetch/$s_!iN_f!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7c7ba7fa-8377-499d-aabd-fb32c43df468_999x965.png 848w, https://substackcdn.com/image/fetch/$s_!iN_f!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7c7ba7fa-8377-499d-aabd-fb32c43df468_999x965.png 1272w, https://substackcdn.com/image/fetch/$s_!iN_f!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7c7ba7fa-8377-499d-aabd-fb32c43df468_999x965.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Column-major order matrix representation</figcaption></figure></div><p></p><h3>Measuring performance</h3><p><br>I&#8217;ll be measuring the time required to multiply two 4096x4096 matrices of doubles. </p><blockquote><p><em>I&#8217;ve chosen large matrices since some optimization are more pronounced when the matrices don&#8217;t fully fit in cache. With N=4096, the 3 matrices have a total size of  (3*4096*4096*8)/(1024*1024) = 384 MiB, well over my 16 MiB L3 cache.</em></p></blockquote><p><br>I also want to briefly discuss how I&#8217;ll compile code. Two main options are:</p><ul><li><p>Introduce compiler flags as I introduce code optimizations</p></li><li><p>Enable all relevant optimization flags from the outset</p><p></p></li></ul><p>I&#8217;ll go with the second option as I think it&#8217;s slightly more transparent. I&#8217;ll make sure to highlight when a code change causes &#8220;unintentional&#8221; optimizations. Unless stated otherwise, I&#8217;ll be using the following flags.</p><pre><code>clang main.c matmul.c -std=gnu11 -O3 -DNDEBUG -march=native -mfma -ffast-math -mavx2 -lrt -lblis -o matmul</code></pre><p></p><h3>Baselines</h3><p>With code compiled, we can finally get first results.</p><p>The naive implementation multiplies two 4096x4096 matrices in leisurely <strong>480 seconds,</strong></p><pre><code>michal@michal-lg:~/code/cpu_matmul$ ./matmul 4096
Elapsed execution time: 480.609548 sec; N: 4096, __TYPE__: double</code></pre><p>while the much snappier bli_dgemm does it in <strong>2.6s</strong>, around <strong>~200x faster.</strong></p><pre><code>michal@michal-lg:~/code/cpu_matmul$ ./matmul 4096
Elapsed execution time: 2.564394 sec; N: 4096, __TYPE__: double</code></pre><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://michalpitr.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Enjoying the article so far? Consider subscribing!</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p></p><h3>Searching for bottlenecks</h3><p>I&#8217;ve run the naive implementation through <code>perf</code> to collect performance counters. I&#8217;m using N=1024 matrices to speed things up when profiling.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!H_7-!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa121a733-e6bc-430c-8d52-1e2763894263_3680x2072.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!H_7-!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa121a733-e6bc-430c-8d52-1e2763894263_3680x2072.png 424w, https://substackcdn.com/image/fetch/$s_!H_7-!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa121a733-e6bc-430c-8d52-1e2763894263_3680x2072.png 848w, https://substackcdn.com/image/fetch/$s_!H_7-!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa121a733-e6bc-430c-8d52-1e2763894263_3680x2072.png 1272w, https://substackcdn.com/image/fetch/$s_!H_7-!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa121a733-e6bc-430c-8d52-1e2763894263_3680x2072.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!H_7-!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa121a733-e6bc-430c-8d52-1e2763894263_3680x2072.png" width="1456" height="820" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a121a733-e6bc-430c-8d52-1e2763894263_3680x2072.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:820,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1458340,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!H_7-!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa121a733-e6bc-430c-8d52-1e2763894263_3680x2072.png 424w, https://substackcdn.com/image/fetch/$s_!H_7-!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa121a733-e6bc-430c-8d52-1e2763894263_3680x2072.png 848w, https://substackcdn.com/image/fetch/$s_!H_7-!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa121a733-e6bc-430c-8d52-1e2763894263_3680x2072.png 1272w, https://substackcdn.com/image/fetch/$s_!H_7-!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa121a733-e6bc-430c-8d52-1e2763894263_3680x2072.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Results of perf stat</figcaption></figure></div><p>The cache miss rate stands out to me as surprisingly large. Let&#8217;s use a <a href="http://michalpitr.com">little tool</a> I wrote to better visualize memory accesses. <br><br>The video below shows both the logical and memory representations of matrices. As I step through the naive multiplication, notice where the accessed elements are located in memory.</p><div class="native-video-embed" data-component-name="VideoPlaceholder" data-attrs="{&quot;mediaUploadId&quot;:&quot;30217f11-f2ff-4f50-9c21-e6f9c192d44a&quot;,&quot;duration&quot;:null}"></div><p>Focus on which element of A gets accessed on each iteration. While in the logical representation we are accessing sequential elements within a row, due to the column-major order, these elements are actually N elements apart in memory. This is illustrated by the memory views at the bottom.<br><br>This is a major issue for sufficiently large matrices. Each access requries reading a new cache line from main memory. By the time the loop wraps around to the first element, chances are the original cache line was already evicted - forcing us to refetch it again.</p><p>I&#8217;ve illustrated a few iterations for matrix A. For illustrative purposes, I&#8217;m assuming a very small, fully associative cache that can only store 3 lines. Each cache line is 16 bytes or 2 doubles.</p><p>In the first loop, A[0][0] is read, it&#8217;s not in the cache, which results in cache miss, so the cache line has to be fetched from main memory. Notice that we fetch a full cache line even when we want only one element from it. On modern CPUs cache lines are commonly 64 bytes.</p><p>Next loop A[0][1] is read - again a cache miss.</p><blockquote><p><em>Shame we didn&#8217;t read A[1][0] instead that we already fetched last iteration&#8230; </em></p></blockquote><p>This keeps repeating until we loop back to A[0][0], but unfortunately, by that point, the cache line that contained A[0][0] was already evicted.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!8fib!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4085b6a3-d786-44e1-9463-100e7a292de1_1071x3923.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!8fib!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4085b6a3-d786-44e1-9463-100e7a292de1_1071x3923.png 424w, https://substackcdn.com/image/fetch/$s_!8fib!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4085b6a3-d786-44e1-9463-100e7a292de1_1071x3923.png 848w, https://substackcdn.com/image/fetch/$s_!8fib!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4085b6a3-d786-44e1-9463-100e7a292de1_1071x3923.png 1272w, https://substackcdn.com/image/fetch/$s_!8fib!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4085b6a3-d786-44e1-9463-100e7a292de1_1071x3923.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!8fib!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4085b6a3-d786-44e1-9463-100e7a292de1_1071x3923.png" width="249" height="912.0700280112045" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4085b6a3-d786-44e1-9463-100e7a292de1_1071x3923.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:3923,&quot;width&quot;:1071,&quot;resizeWidth&quot;:249,&quot;bytes&quot;:565334,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!8fib!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4085b6a3-d786-44e1-9463-100e7a292de1_1071x3923.png 424w, https://substackcdn.com/image/fetch/$s_!8fib!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4085b6a3-d786-44e1-9463-100e7a292de1_1071x3923.png 848w, https://substackcdn.com/image/fetch/$s_!8fib!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4085b6a3-d786-44e1-9463-100e7a292de1_1071x3923.png 1272w, https://substackcdn.com/image/fetch/$s_!8fib!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4085b6a3-d786-44e1-9463-100e7a292de1_1071x3923.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Cache evictions</figcaption></figure></div><p>Try to convince yourself that if we instead iterated over matrix A in column order, the cache hit rate would significantly improve.</p><p></p><div><hr></div><p></p><h2>Loop reordering</h2><p>Previously, we looped over the matrices in <em>r</em>, <em>c</em>, <em>k</em> order. Based on the findings in the last section, we&#8217;d like to iterate in row-order in the inner loop as much as possible. </p><p>We can just reorder these loops however we want without affecting correctness.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!z3xl!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd7a7abf1-12c6-4bf0-93cc-83410819dd10_2920x1796.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!z3xl!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd7a7abf1-12c6-4bf0-93cc-83410819dd10_2920x1796.png 424w, https://substackcdn.com/image/fetch/$s_!z3xl!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd7a7abf1-12c6-4bf0-93cc-83410819dd10_2920x1796.png 848w, https://substackcdn.com/image/fetch/$s_!z3xl!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd7a7abf1-12c6-4bf0-93cc-83410819dd10_2920x1796.png 1272w, https://substackcdn.com/image/fetch/$s_!z3xl!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd7a7abf1-12c6-4bf0-93cc-83410819dd10_2920x1796.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!z3xl!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd7a7abf1-12c6-4bf0-93cc-83410819dd10_2920x1796.png" width="665" height="409.2307692307692" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d7a7abf1-12c6-4bf0-93cc-83410819dd10_2920x1796.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:896,&quot;width&quot;:1456,&quot;resizeWidth&quot;:665,&quot;bytes&quot;:1425320,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!z3xl!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd7a7abf1-12c6-4bf0-93cc-83410819dd10_2920x1796.png 424w, https://substackcdn.com/image/fetch/$s_!z3xl!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd7a7abf1-12c6-4bf0-93cc-83410819dd10_2920x1796.png 848w, https://substackcdn.com/image/fetch/$s_!z3xl!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd7a7abf1-12c6-4bf0-93cc-83410819dd10_2920x1796.png 1272w, https://substackcdn.com/image/fetch/$s_!z3xl!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd7a7abf1-12c6-4bf0-93cc-83410819dd10_2920x1796.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">We can swap loops at will since there are no computational dependencies</figcaption></figure></div><p>This presents us with two compelling loop orders: <em>c</em>,<em> k</em>,<em> r</em> and <em>k</em>,<em> c</em>,<em> r</em>. </p><p>I expect <em>c</em>, <em>k</em>,<em> r</em> to perform better since k is used as the row index in B, but intuition is often misleading, so let&#8217;s measure.<br><br><em>c</em>, <em>k</em>, <em>r</em>:</p><pre><code>michal@michal-lg:~/code/cpu_matmul$ ./matmul 4096
Elapsed execution time: 20.784827 sec; N: 4096, __TYPE__: double</code></pre><p><em>k</em>, <em>c</em>, <em>r</em>:</p><pre><code>michal@michal-lg:~/code/cpu_matmul$ ./matmul 4096
Elapsed execution time: 51.014005 sec; N: 4096, __TYPE__: double</code></pre><p>Just changing the loop order improved performance by <strong>~20x</strong>.</p><p>Re-running <code>perf stat</code> shows a <strong>~50x</strong> reduction in L3 cache misses.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!RNGL!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2bc87371-8f34-4960-b123-2f39f15c5876_3680x2072.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!RNGL!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2bc87371-8f34-4960-b123-2f39f15c5876_3680x2072.png 424w, https://substackcdn.com/image/fetch/$s_!RNGL!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2bc87371-8f34-4960-b123-2f39f15c5876_3680x2072.png 848w, https://substackcdn.com/image/fetch/$s_!RNGL!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2bc87371-8f34-4960-b123-2f39f15c5876_3680x2072.png 1272w, https://substackcdn.com/image/fetch/$s_!RNGL!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2bc87371-8f34-4960-b123-2f39f15c5876_3680x2072.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!RNGL!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2bc87371-8f34-4960-b123-2f39f15c5876_3680x2072.png" width="1456" height="820" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/2bc87371-8f34-4960-b123-2f39f15c5876_3680x2072.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:820,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1449556,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!RNGL!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2bc87371-8f34-4960-b123-2f39f15c5876_3680x2072.png 424w, https://substackcdn.com/image/fetch/$s_!RNGL!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2bc87371-8f34-4960-b123-2f39f15c5876_3680x2072.png 848w, https://substackcdn.com/image/fetch/$s_!RNGL!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2bc87371-8f34-4960-b123-2f39f15c5876_3680x2072.png 1272w, https://substackcdn.com/image/fetch/$s_!RNGL!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2bc87371-8f34-4960-b123-2f39f15c5876_3680x2072.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">perf stat for reordered loops</figcaption></figure></div><p>Besides better access pattern, reordering loops allowed the compiler to vectorize some operations! Let&#8217;s discuss that next.</p><p></p><h3>Inspecting Assembly</h3><p>In this section we&#8217;ll look at how the assembly generated changes with vectorization on and off. </p><h4>Non-vectorized</h4><p>The figure below shows assembly for the inner loop with vectorization disabled. You can see the full function at <a href="https://godbolt.org/z/c6x46veeW">compiler explorer</a>. </p><blockquote><p><em>Note: Here I only use compiler flags &#8220;</em>-O3 -fno-vectorize&#8221;</p></blockquote><p>I&#8217;ve annotated the interesting lines. Hopefully with some eye squirming you can convince yourself that the C code does logically translate into the assembly code on the right.</p><blockquote><p><em>I grew up reading left to right so I prefer the AT&amp;T assembly syntax. Sorry my Arab friends</em>.</p></blockquote><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!GYKC!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe21665bd-74c7-4e48-ad3c-8f1f806c1a0f_3767x1903.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!GYKC!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe21665bd-74c7-4e48-ad3c-8f1f806c1a0f_3767x1903.png 424w, https://substackcdn.com/image/fetch/$s_!GYKC!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe21665bd-74c7-4e48-ad3c-8f1f806c1a0f_3767x1903.png 848w, https://substackcdn.com/image/fetch/$s_!GYKC!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe21665bd-74c7-4e48-ad3c-8f1f806c1a0f_3767x1903.png 1272w, https://substackcdn.com/image/fetch/$s_!GYKC!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe21665bd-74c7-4e48-ad3c-8f1f806c1a0f_3767x1903.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!GYKC!,w_2400,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe21665bd-74c7-4e48-ad3c-8f1f806c1a0f_3767x1903.png" width="1200" height="606.5934065934066" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e21665bd-74c7-4e48-ad3c-8f1f806c1a0f_3767x1903.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:false,&quot;imageSize&quot;:&quot;large&quot;,&quot;height&quot;:736,&quot;width&quot;:1456,&quot;resizeWidth&quot;:1200,&quot;bytes&quot;:2215186,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-large" alt="" srcset="https://substackcdn.com/image/fetch/$s_!GYKC!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe21665bd-74c7-4e48-ad3c-8f1f806c1a0f_3767x1903.png 424w, https://substackcdn.com/image/fetch/$s_!GYKC!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe21665bd-74c7-4e48-ad3c-8f1f806c1a0f_3767x1903.png 848w, https://substackcdn.com/image/fetch/$s_!GYKC!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe21665bd-74c7-4e48-ad3c-8f1f806c1a0f_3767x1903.png 1272w, https://substackcdn.com/image/fetch/$s_!GYKC!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe21665bd-74c7-4e48-ad3c-8f1f806c1a0f_3767x1903.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Assembly for non-vectorized matmul</figcaption></figure></div><p>As highlighted, clang applied a loop unrolling optimization with a factor of 2, i.e. on each inner loop iteration, it calculates C[r][c] and C[r+1][c]. </p><p>If it didn&#8217;t do this, 3 out of 7 instructions per inner loop execution would be purely for the loop overhead: increment counter, compare with loop limit, and jump. That would be expensive so unrolling ammortizes the cost. In fact, when the number of loop iterations is known at compile time, it&#8217;s very common to see completely unrolled loops.</p><blockquote><p><em>Note how the compiler still generated </em>LBB0_5<em> in case the loop has an odd number of iterations.</em></p></blockquote><h4>Vectorized</h4><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!DFZK!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ab4e67a-4962-4282-978d-f234faad7a3f_669x373.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!DFZK!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ab4e67a-4962-4282-978d-f234faad7a3f_669x373.jpeg 424w, https://substackcdn.com/image/fetch/$s_!DFZK!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ab4e67a-4962-4282-978d-f234faad7a3f_669x373.jpeg 848w, https://substackcdn.com/image/fetch/$s_!DFZK!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ab4e67a-4962-4282-978d-f234faad7a3f_669x373.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!DFZK!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ab4e67a-4962-4282-978d-f234faad7a3f_669x373.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!DFZK!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ab4e67a-4962-4282-978d-f234faad7a3f_669x373.jpeg" width="455" height="253.6846038863976" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/5ab4e67a-4962-4282-978d-f234faad7a3f_669x373.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:373,&quot;width&quot;:669,&quot;resizeWidth&quot;:455,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!DFZK!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ab4e67a-4962-4282-978d-f234faad7a3f_669x373.jpeg 424w, https://substackcdn.com/image/fetch/$s_!DFZK!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ab4e67a-4962-4282-978d-f234faad7a3f_669x373.jpeg 848w, https://substackcdn.com/image/fetch/$s_!DFZK!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ab4e67a-4962-4282-978d-f234faad7a3f_669x373.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!DFZK!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ab4e67a-4962-4282-978d-f234faad7a3f_669x373.jpeg 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"></figcaption></figure></div><p>Modern CPUs have extra hardware to support SIMD (single instruction, multiple data) instructions. This lets a single CPU instruction operate on multiple registers in parallel.</p><p>I&#8217;m using the following compiler flags <em>&#8220;-03 -ffast-math -mavx2&#8221;</em> so that the compiler generates vectorized code. Again, feel free to play with the <a href="https://godbolt.org/z/4nbv45jzh">compiler explorer</a> code.</p><blockquote><p><em>I added a promise to the compiler that the data will be aligned to 32 bytes in memory. Some vectorized instructions require memory alignment or have a faster aligned variant.</em></p></blockquote><p>I&#8217;ve again annotated the assembly.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!v9Gx!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F086f3260-222e-4ff0-8fa5-e4edeeb6695b_4479x2312.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!v9Gx!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F086f3260-222e-4ff0-8fa5-e4edeeb6695b_4479x2312.png 424w, https://substackcdn.com/image/fetch/$s_!v9Gx!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F086f3260-222e-4ff0-8fa5-e4edeeb6695b_4479x2312.png 848w, https://substackcdn.com/image/fetch/$s_!v9Gx!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F086f3260-222e-4ff0-8fa5-e4edeeb6695b_4479x2312.png 1272w, https://substackcdn.com/image/fetch/$s_!v9Gx!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F086f3260-222e-4ff0-8fa5-e4edeeb6695b_4479x2312.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!v9Gx!,w_2400,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F086f3260-222e-4ff0-8fa5-e4edeeb6695b_4479x2312.png" width="1278" height="660.065934065934" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/086f3260-222e-4ff0-8fa5-e4edeeb6695b_4479x2312.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:false,&quot;imageSize&quot;:&quot;large&quot;,&quot;height&quot;:752,&quot;width&quot;:1456,&quot;resizeWidth&quot;:1278,&quot;bytes&quot;:2981464,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-large" alt="" srcset="https://substackcdn.com/image/fetch/$s_!v9Gx!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F086f3260-222e-4ff0-8fa5-e4edeeb6695b_4479x2312.png 424w, https://substackcdn.com/image/fetch/$s_!v9Gx!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F086f3260-222e-4ff0-8fa5-e4edeeb6695b_4479x2312.png 848w, https://substackcdn.com/image/fetch/$s_!v9Gx!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F086f3260-222e-4ff0-8fa5-e4edeeb6695b_4479x2312.png 1272w, https://substackcdn.com/image/fetch/$s_!v9Gx!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F086f3260-222e-4ff0-8fa5-e4edeeb6695b_4479x2312.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Assembly with SIMD instructions</figcaption></figure></div><p>The general sequence is very similar to before, but now the compiler uses AVX2 SIMD instructions. Availability of instructions depends on your CPU. You can read <code>/proc/cpuinfo</code> or run <code>cpuid</code> to see what your CPU supports. </p><p>Each AVX2 register (<em>ymm</em>) can fit 256 bits, i.e. four doubles, so each <code>vmulpd</code>, <code>vaddpd</code>, and <code>vmovupd</code> processes 4 multiplications, additions, and stores in parallel.<br><br>As you can see, the compiler again did some unrolling. This time with a factor of 4, so each inner loop execution produces 16 writes to C. </p><p>The following diagram illustrates the computational dependencies.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!7pgP!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb7bd4659-2e75-4dee-b884-69765c571e7d_1745x2740.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!7pgP!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb7bd4659-2e75-4dee-b884-69765c571e7d_1745x2740.png 424w, https://substackcdn.com/image/fetch/$s_!7pgP!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb7bd4659-2e75-4dee-b884-69765c571e7d_1745x2740.png 848w, https://substackcdn.com/image/fetch/$s_!7pgP!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb7bd4659-2e75-4dee-b884-69765c571e7d_1745x2740.png 1272w, https://substackcdn.com/image/fetch/$s_!7pgP!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb7bd4659-2e75-4dee-b884-69765c571e7d_1745x2740.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!7pgP!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb7bd4659-2e75-4dee-b884-69765c571e7d_1745x2740.png" width="519" height="814.8585164835165" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b7bd4659-2e75-4dee-b884-69765c571e7d_1745x2740.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:2286,&quot;width&quot;:1456,&quot;resizeWidth&quot;:519,&quot;bytes&quot;:589974,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!7pgP!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb7bd4659-2e75-4dee-b884-69765c571e7d_1745x2740.png 424w, https://substackcdn.com/image/fetch/$s_!7pgP!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb7bd4659-2e75-4dee-b884-69765c571e7d_1745x2740.png 848w, https://substackcdn.com/image/fetch/$s_!7pgP!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb7bd4659-2e75-4dee-b884-69765c571e7d_1745x2740.png 1272w, https://substackcdn.com/image/fetch/$s_!7pgP!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb7bd4659-2e75-4dee-b884-69765c571e7d_1745x2740.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Illustration of computational graph with SIMD instructions</figcaption></figure></div><blockquote><p><em>You can try to enable the fused add-multiply option on compiler explorer (-mfma flag) to see that </em><code>vmulpd</code><em> and </em><code>vaddpd</code><em> instructions get merged into a single one.</em></p></blockquote><p>Given the 4x increase in parallelism, one would hope for 4x speedup - Unfortunately, I&#8217;m getting a more modest 1.5x improvement.</p><blockquote><p><em>An important corollary of the AVX2 register size is that we can trade off numerical precision for higher parallelism. This extends to GPU-based ML training/inference, where half-precision fp16 or even fp8 floats are often used.</em></p></blockquote><p><br>Just to briefly recap, we are currently sitting at ~10% of bli_dgemm&#8217;s single-threaded performance. Now might be a good time to make a second coffee and stretch a bit before proceeding. </p><p></p><div><hr></div><p></p><h2>Cache utilization</h2><p>In this part we&#8217;ll cover a really cool optimization that&#8217;s unfortunately super unintuitive at first, at least to me anyways.</p><p>Let&#8217;s inspect the access pattern again. I encourage you to step through this yourself by selecting the naive order(jki) option in <a href="https://michalpitr.com/">my matmul visualization tool</a>.</p><div class="native-video-embed" data-component-name="VideoPlaceholder" data-attrs="{&quot;mediaUploadId&quot;:&quot;21f9820c-0ae9-46ee-b613-a9853471e2f6&quot;,&quot;duration&quot;:null}"></div><p>Notice how the calculation scans through the entire matrix A N times? Ideally we&#8217;d load each element once, do all the computations where its needed, evict it, and never load it again.</p><p>Sticking with the 4x4 example; When A[0][0] is loaded, we only use it once to calculate the partical result C[0][0] += A[0][0] * B[0][0], before moving onto C[1][0] = A[1][0] * B[0][0]. </p><p>Instead of moving onto A[1][0] immediately, we&#8217;d like to use A[0][0] in as many computations as we can. It&#8217;s also used as a dependency for C[0][1], C[0][2], and C[0][3].</p><p>But if we try to reorder computation to do that, we are somewhat back to a poor strided access pattern into C. Earlier we saw how strided access can cause elements to be evicted from cache before they are needed again when the full matrix doesn&#8217;t fit in cache. <em>Well, what if we could make it fit into cache?</em></p><p>If we can restructure the product of two large matrices into products of smaller matrices, then we can tune the small matrix size so that things fit nicely in cache!</p><h3>Tiling </h3><p>The technique I long-windedly introduced above is known as tiling. Let&#8217;s first see tiling in action, then I&#8217;ll show you the code, and then we&#8217;ll do some benchmarking.</p><p>I&#8217;m using tile size (or block size) of 2 and yet again, you can play with it in the <a href="http://michalpitr.com">matrix visualization tool</a>.</p><div class="native-video-embed" data-component-name="VideoPlaceholder" data-attrs="{&quot;mediaUploadId&quot;:&quot;db23abdd-49c9-4a92-91cd-25bedb291606&quot;,&quot;duration&quot;:null}"></div><p>In the non-tiled version, we scanned through the entire matrix A N times. You can check that every element of A is still accessed exactly the same number of times as before. So what changed? Each tile only needs to be loaded once into cache, since we can tune it to fit.</p><p>In general, we are hoping to reduce the number of cache misses by a factor proportional to the tile size.</p><h3>More loops</h3><p>Below you can see the tiled implementation. It&#8217;s starting to get a bit ugly with the tiling loops, but try to read through it anyways. Note that we do exactly the same amount of work as before (minus look overhead), just in a different order that better utilizes the cache.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!nkH5!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F23fca929-a1dd-4f36-a6a5-813d407d9149_3260x2792.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!nkH5!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F23fca929-a1dd-4f36-a6a5-813d407d9149_3260x2792.png 424w, https://substackcdn.com/image/fetch/$s_!nkH5!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F23fca929-a1dd-4f36-a6a5-813d407d9149_3260x2792.png 848w, https://substackcdn.com/image/fetch/$s_!nkH5!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F23fca929-a1dd-4f36-a6a5-813d407d9149_3260x2792.png 1272w, https://substackcdn.com/image/fetch/$s_!nkH5!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F23fca929-a1dd-4f36-a6a5-813d407d9149_3260x2792.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!nkH5!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F23fca929-a1dd-4f36-a6a5-813d407d9149_3260x2792.png" width="1456" height="1247" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/23fca929-a1dd-4f36-a6a5-813d407d9149_3260x2792.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1247,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:2077529,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!nkH5!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F23fca929-a1dd-4f36-a6a5-813d407d9149_3260x2792.png 424w, https://substackcdn.com/image/fetch/$s_!nkH5!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F23fca929-a1dd-4f36-a6a5-813d407d9149_3260x2792.png 848w, https://substackcdn.com/image/fetch/$s_!nkH5!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F23fca929-a1dd-4f36-a6a5-813d407d9149_3260x2792.png 1272w, https://substackcdn.com/image/fetch/$s_!nkH5!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F23fca929-a1dd-4f36-a6a5-813d407d9149_3260x2792.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">tiled implementation</figcaption></figure></div><p>I selected the tile_size through the best technique known to man - <em>trying a bunch of values and picking the best</em>.</p><ul><li><p>4 - 45s</p></li><li><p>8 - 16s</p></li><li><p>16 - 19s</p></li><li><p>32 - 10s</p></li><li><p>64 - 7.3s</p></li><li><p><strong>128 - 6.6s</strong></p></li><li><p>256 - 9s </p></li></ul><p>We are finally getting to pretty small numbers where sub-second precision is starting to matter. However, there can be a decent run-to-run variance. This variance is primiarly caused by other processes on my system polluting CPU cache, process context switches etc.</p><p>For the different tile sizes above, I ran the measurement a few times for each setting and picked the minimum duration for each. This seems counterintuitive, but when noise is the main source of variance, the fastest run is the one with the least noise.</p><p>6.6 seconds is a ~3x improvement, putting us at ~30% of bli_dgemm performance.</p><h3>Tile size analysis</h3><p>Can we explain why the 128 x 128 tile size performs the best on my system? </p><p>When I measure the L3 and L2 cache misses for N=4096 matrices, it turns out that tiled_dgemm performs significantly worse on both counts than our previous implementation. L3 misses are at ~2B compared to ~500M for naive_ordered_dgemm.</p><p>The number of L3 reads is lower by about 10B, but I&#8217;m not super positive that fully explains the better performance. Suffice to say I was pretty confused at this point. Inspecting the newly generated assembly shows <a href="https://godbolt.org/z/zT6K4aWrx">very aggressive unrolling</a> in the tiling block. My best guess is that despite the large cache miss count, the CPU is able to hide read latencies by overlaying reads and computation thanks to the aggressive unrolling.</p><p>Still, the fact that number of cache misses went up is surprising to me. Let&#8217;s explore that next.</p><p></p><div><hr></div><p></p><h2>Cache hardware</h2><p>Until now we were assuming a fully associative cache - one where a cache line can be stored anywhere. Being able to store a cache line anywhere sounds convenient, but the flip side is that to check if a given cache line is stored, all entries need to be checked.</p><p>Instead, caches are commonly partitioned into sets of <em>k</em> cache lines. Then for a cache line, based on its address, its sufficient to search through the k lines in its corresponding set.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!9Bg8!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F575f94d3-408c-4a9f-b09b-919b575ec348_1501x1684.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!9Bg8!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F575f94d3-408c-4a9f-b09b-919b575ec348_1501x1684.png 424w, https://substackcdn.com/image/fetch/$s_!9Bg8!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F575f94d3-408c-4a9f-b09b-919b575ec348_1501x1684.png 848w, https://substackcdn.com/image/fetch/$s_!9Bg8!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F575f94d3-408c-4a9f-b09b-919b575ec348_1501x1684.png 1272w, https://substackcdn.com/image/fetch/$s_!9Bg8!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F575f94d3-408c-4a9f-b09b-919b575ec348_1501x1684.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!9Bg8!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F575f94d3-408c-4a9f-b09b-919b575ec348_1501x1684.png" width="473" height="530.8255494505495" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/575f94d3-408c-4a9f-b09b-919b575ec348_1501x1684.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1634,&quot;width&quot;:1456,&quot;resizeWidth&quot;:473,&quot;bytes&quot;:900558,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!9Bg8!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F575f94d3-408c-4a9f-b09b-919b575ec348_1501x1684.png 424w, https://substackcdn.com/image/fetch/$s_!9Bg8!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F575f94d3-408c-4a9f-b09b-919b575ec348_1501x1684.png 848w, https://substackcdn.com/image/fetch/$s_!9Bg8!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F575f94d3-408c-4a9f-b09b-919b575ec348_1501x1684.png 1272w, https://substackcdn.com/image/fetch/$s_!9Bg8!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F575f94d3-408c-4a9f-b09b-919b575ec348_1501x1684.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Cache line mapping to cache sets</figcaption></figure></div><p>Let&#8217;s see how given an address, we determine where in cache it can reside. I&#8217;m basing this on my L3 cache specs:  k=16-way associative cache, B=64 byte cache line, M=16 MiB cache size, and w=64 bit words.<br><br>Address resolves to a cache line using the following mapping:</p><ul><li><p>offset -  lg(B=64) = least significant 6 bits</p></li><li><p>set - lg(M/kB) = lg(16MiB/(16*64)) = 14 bits</p></li><li><p>tag - 64 - 14 - 6 = 44 bits </p></li></ul><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!H2W2!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F72ea28cb-b479-4934-85cd-24619a01e0c2_1857x191.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!H2W2!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F72ea28cb-b479-4934-85cd-24619a01e0c2_1857x191.png 424w, https://substackcdn.com/image/fetch/$s_!H2W2!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F72ea28cb-b479-4934-85cd-24619a01e0c2_1857x191.png 848w, https://substackcdn.com/image/fetch/$s_!H2W2!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F72ea28cb-b479-4934-85cd-24619a01e0c2_1857x191.png 1272w, https://substackcdn.com/image/fetch/$s_!H2W2!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F72ea28cb-b479-4934-85cd-24619a01e0c2_1857x191.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!H2W2!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F72ea28cb-b479-4934-85cd-24619a01e0c2_1857x191.png" width="1456" height="150" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/72ea28cb-b479-4934-85cd-24619a01e0c2_1857x191.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:150,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:179649,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!H2W2!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F72ea28cb-b479-4934-85cd-24619a01e0c2_1857x191.png 424w, https://substackcdn.com/image/fetch/$s_!H2W2!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F72ea28cb-b479-4934-85cd-24619a01e0c2_1857x191.png 848w, https://substackcdn.com/image/fetch/$s_!H2W2!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F72ea28cb-b479-4934-85cd-24619a01e0c2_1857x191.png 1272w, https://substackcdn.com/image/fetch/$s_!H2W2!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F72ea28cb-b479-4934-85cd-24619a01e0c2_1857x191.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">How are parts of memory address used to map to  cache set</figcaption></figure></div><p>Note that offset contains the least-significant bits. This intuitively makes sense - we want addresses in close proximity to belong to the same cache-line.</p><p>So, how does this relate to the increased L3 misses we saw earlier?</p><p>Let&#8217;s focus what happens when we load a single tile into memory. Suppose that tile[0][0] element is at address <code>x</code>. Then the element tile[1][0] is at address <code>x+8</code>, tile[2][0] at <code>x+16</code>, etc. 8 consecutive elements will belong to the same cache line. tile[8][0] will have address <code>x+64</code>, which flips the 7th bit, so it belongs to a different set. The last element of the first tile column will have address <code>x+127*8</code> , which means that first column required ceil(127*8 / 64) = 16 lines in 16 different cache sets.</p><p>When we wrap around to the next column, tile[0][1] has address <code>x+4096*8</code> or <code>x+2^15</code>. Recall that we are dealing with 4096 x 4096 matrices. In general, the address of tile[0][i] will be <code>x + N*sizeof(double)*i</code>, where N is the matrix dimension.</p><p>Notice how for my L3 cache, only the bottom 20 bits are used to map an address to the cache. When two addresses differ only in top 44 bits, those two address will map to the same cache set. Given the earlier formula, we can calculate which columns will start mapping to the same cache set by solving for <code>i</code>.</p><p>x + 2^20 = x + 4096*8*i   &#8594; i = 2^20 - 2^15 = 2^5 = 32.</p><p>So we know that tile[0][0], tile[0][32], tile[0][64], and tile[0][96] all map to the same cache set. Below is a more dramatic visualization. Colored blocks within the tile depict cache lines that map to the same cache sets.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!_2Qc!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa465467e-eec7-41e1-abf0-833a735a6c71_1300x736.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!_2Qc!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa465467e-eec7-41e1-abf0-833a735a6c71_1300x736.png 424w, https://substackcdn.com/image/fetch/$s_!_2Qc!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa465467e-eec7-41e1-abf0-833a735a6c71_1300x736.png 848w, https://substackcdn.com/image/fetch/$s_!_2Qc!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa465467e-eec7-41e1-abf0-833a735a6c71_1300x736.png 1272w, https://substackcdn.com/image/fetch/$s_!_2Qc!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa465467e-eec7-41e1-abf0-833a735a6c71_1300x736.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!_2Qc!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa465467e-eec7-41e1-abf0-833a735a6c71_1300x736.png" width="1300" height="736" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a465467e-eec7-41e1-abf0-833a735a6c71_1300x736.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:736,&quot;width&quot;:1300,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:512061,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!_2Qc!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa465467e-eec7-41e1-abf0-833a735a6c71_1300x736.png 424w, https://substackcdn.com/image/fetch/$s_!_2Qc!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa465467e-eec7-41e1-abf0-833a735a6c71_1300x736.png 848w, https://substackcdn.com/image/fetch/$s_!_2Qc!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa465467e-eec7-41e1-abf0-833a735a6c71_1300x736.png 1272w, https://substackcdn.com/image/fetch/$s_!_2Qc!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa465467e-eec7-41e1-abf0-833a735a6c71_1300x736.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Cache set conflicts</figcaption></figure></div><p>Is this really an issue? After all, we have 16 cache lines per set.</p><p>Well, that&#8217;s true, but we also have 2 other matrices and there are other processes running on my system. Creating unnecessary pressure on a given cache set increases likelihood of so-called conflict misses. This happens when a line is evicted because its cache set is full.</p><p>This issue would be even more pronounced in L2 and L1 caches that are smaller and typically have lower associativity.</p><h3>Packing</h3><p>If we lay out a 128 x 128 tile of doubles sequentially in memory, the difference between the first and last address is 2^8*2^8*2^3 = 2^19. With this we don&#8217;t run into the same cache set conflicts and the tile is uniformaly distributed across different cache sets.</p><p>This suggests an optimization strategy - copy tiles into arrays and do the tile-tile matrix multiplication over the arrays. Our hope is that the better cache hit rate will compensate for the extra copying work.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!hS1z!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd5bfa321-6fc4-41e4-a0fd-b1e31b90f42f_3296x5220.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!hS1z!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd5bfa321-6fc4-41e4-a0fd-b1e31b90f42f_3296x5220.png 424w, https://substackcdn.com/image/fetch/$s_!hS1z!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd5bfa321-6fc4-41e4-a0fd-b1e31b90f42f_3296x5220.png 848w, https://substackcdn.com/image/fetch/$s_!hS1z!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd5bfa321-6fc4-41e4-a0fd-b1e31b90f42f_3296x5220.png 1272w, https://substackcdn.com/image/fetch/$s_!hS1z!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd5bfa321-6fc4-41e4-a0fd-b1e31b90f42f_3296x5220.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!hS1z!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd5bfa321-6fc4-41e4-a0fd-b1e31b90f42f_3296x5220.png" width="1456" height="2306" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d5bfa321-6fc4-41e4-a0fd-b1e31b90f42f_3296x5220.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:2306,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:3858168,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!hS1z!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd5bfa321-6fc4-41e4-a0fd-b1e31b90f42f_3296x5220.png 424w, https://substackcdn.com/image/fetch/$s_!hS1z!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd5bfa321-6fc4-41e4-a0fd-b1e31b90f42f_3296x5220.png 848w, https://substackcdn.com/image/fetch/$s_!hS1z!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd5bfa321-6fc4-41e4-a0fd-b1e31b90f42f_3296x5220.png 1272w, https://substackcdn.com/image/fetch/$s_!hS1z!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd5bfa321-6fc4-41e4-a0fd-b1e31b90f42f_3296x5220.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Matrix multiplication with tiling, packing, and inner loop extracted</figcaption></figure></div><p>Instead of directly operating on matrices, we now copy tile of A and tile of B into helper arrays and operate on those. You can see the code and generated assembly at <a href="https://godbolt.org/z/brWYEcnev">compiler explorer</a>. I reran experiments for different tile sizes and 64 happens to be the optimal size.</p><p>I also extracted the inner loops into a separate function, which appears to help the compiler generate more optimized assembly on my hardware. I suspect this has to do with better register allocation. The extraction alone improves runtime from the previous ~6.6s to ~6.2s. Note that the function gets inlined so there&#8217;s no call overhead.</p><p>So how much does this improve performance?</p><p><em>*drum roll*</em> </p><p>We are down to <strong>~4.4s</strong> or around 60% of bli_dgemm. Relative to the naive implementation, the current solution is over 100x faster.</p><p>As a final sanity check, L3 cache misses did go down by about 4x.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!qGiS!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2e824206-ca20-4eb8-b85e-f4b85c2ef4db_3680x1876.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!qGiS!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2e824206-ca20-4eb8-b85e-f4b85c2ef4db_3680x1876.png 424w, https://substackcdn.com/image/fetch/$s_!qGiS!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2e824206-ca20-4eb8-b85e-f4b85c2ef4db_3680x1876.png 848w, https://substackcdn.com/image/fetch/$s_!qGiS!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2e824206-ca20-4eb8-b85e-f4b85c2ef4db_3680x1876.png 1272w, https://substackcdn.com/image/fetch/$s_!qGiS!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2e824206-ca20-4eb8-b85e-f4b85c2ef4db_3680x1876.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!qGiS!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2e824206-ca20-4eb8-b85e-f4b85c2ef4db_3680x1876.png" width="1456" height="742" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/2e824206-ca20-4eb8-b85e-f4b85c2ef4db_3680x1876.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:742,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1773766,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!qGiS!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2e824206-ca20-4eb8-b85e-f4b85c2ef4db_3680x1876.png 424w, https://substackcdn.com/image/fetch/$s_!qGiS!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2e824206-ca20-4eb8-b85e-f4b85c2ef4db_3680x1876.png 848w, https://substackcdn.com/image/fetch/$s_!qGiS!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2e824206-ca20-4eb8-b85e-f4b85c2ef4db_3680x1876.png 1272w, https://substackcdn.com/image/fetch/$s_!qGiS!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2e824206-ca20-4eb8-b85e-f4b85c2ef4db_3680x1876.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>bli_dgemm still runs circles around my implementation when it comes to L2 cache misses and instructions per cycle. That being said, I&#8217;m happy with getting within 2x of bli_dgemm.</p><blockquote><p><em>The comparison isn&#8217;t completely fair. bli_dgemm supports arbitrary matrix sizes, while I&#8217;m only focusing on large, square, power-of-two sized matrices. If I supported arbitrary matrix sizes, I&#8217;d need to add extra instructions for boundary checks.</em></p></blockquote><p></p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Bvug!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faa44bac0-7513-44f8-a511-88500823ae72_3680x1876.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Bvug!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faa44bac0-7513-44f8-a511-88500823ae72_3680x1876.png 424w, https://substackcdn.com/image/fetch/$s_!Bvug!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faa44bac0-7513-44f8-a511-88500823ae72_3680x1876.png 848w, https://substackcdn.com/image/fetch/$s_!Bvug!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faa44bac0-7513-44f8-a511-88500823ae72_3680x1876.png 1272w, https://substackcdn.com/image/fetch/$s_!Bvug!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faa44bac0-7513-44f8-a511-88500823ae72_3680x1876.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Bvug!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faa44bac0-7513-44f8-a511-88500823ae72_3680x1876.png" width="1456" height="742" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/aa44bac0-7513-44f8-a511-88500823ae72_3680x1876.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:742,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1752070,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Bvug!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faa44bac0-7513-44f8-a511-88500823ae72_3680x1876.png 424w, https://substackcdn.com/image/fetch/$s_!Bvug!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faa44bac0-7513-44f8-a511-88500823ae72_3680x1876.png 848w, https://substackcdn.com/image/fetch/$s_!Bvug!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faa44bac0-7513-44f8-a511-88500823ae72_3680x1876.png 1272w, https://substackcdn.com/image/fetch/$s_!Bvug!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faa44bac0-7513-44f8-a511-88500823ae72_3680x1876.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p></p><h2>Next steps?</h2><p>There are many optimizations we haven&#8217;t attempted yet. For instance, we could add another level of tiling for L2 cache and even for L1 cache. We could also experiment with directly using compiler instrinsics for SIMD instructions. Memory prefetching instructions are also worth experimenting with. There are also recursive algorithms that split matrices into smaller ones instead of using tiling.</p><p>While exciting, this article has gotten a lot longer than I expected, so I&#8217;ll leave those optimization for a potential followup article.</p><p>If you this domain interesting and would like to learn more, I recommend these resources:</p><p><a href="https://ocw.mit.edu/courses/6-172-performance-engineering-of-software-systems-fall-2018/pages/syllabus/">MIT&#8217;s 6.172</a> on OCW - a fantastic introduction to performance engineering. I&#8217;ve been auditing it while writing this article.</p><p><a href="https://siboehm.com/articles/22/Fast-MMM-on-CPU">Simon&#8217;s matrix multiplication</a> article - Simon is a performance engineer at Anthropic. His article covers most of the optimizations I covered here. I like Simon&#8217;s succint style and gorgeous illustrations.<br><br>OpenBLAS <a href="https://github.com/OpenMathLib/OpenBLAS">repository</a> on github - highly optimized open-source kernels.</p><p></p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://michalpitr.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading From Scratch! If you read this far, I&#8217;m sure you&#8217;ll enjoy my other work. Please consider subcribing. </p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p></p><div><hr></div><p>I hope I conveyed not just what optimizations exist but also how one might go about discovering them. I find the latter a lot more satisfying.</p><p>If you enjoyed this article, consider subscribing. I&#8217;m also somewhat active on <a href="https://www.linkedin.com/in/michal-pitr/">LinkedIn</a>, so consider connecting with me there.</p><div class="captioned-button-wrap" data-attrs="{&quot;url&quot;:&quot;https://michalpitr.substack.com/p/optimizing-matrix-multiplication?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;}" data-component-name="CaptionedButtonToDOM"><div class="preamble"><p class="cta-caption">Do you have a friend who might enjoy this article? Consider sharing it with them.</p></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://michalpitr.substack.com/p/optimizing-matrix-multiplication?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://michalpitr.substack.com/p/optimizing-matrix-multiplication?utm_source=substack&utm_medium=email&utm_content=share&action=share"><span>Share</span></a></p></div><p>After a well-deserved break for reading this far, perhaps check out some of my other articles.<br></p><div class="digest-post-embed" data-attrs="{&quot;nodeId&quot;:&quot;ba45ee8c-8771-40a4-af57-2c43919eb0d1&quot;,&quot;caption&quot;:&quot;I recently built a docker clone from scratch in Go. This made me wonder - how hard would it be to do the same step-by-step in a terminal? Let&#8217;s find out!&quot;,&quot;cta&quot;:null,&quot;showBylines&quot;:true,&quot;size&quot;:&quot;md&quot;,&quot;isEditorNode&quot;:true,&quot;title&quot;:&quot;Linux container from scratch&quot;,&quot;publishedBylines&quot;:[{&quot;id&quot;:28235731,&quot;name&quot;:&quot;Michal Pitr&quot;,&quot;bio&quot;:&quot;I write deep dives into software engineering topics.&quot;,&quot;photo_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/73a6e774-038e-424d-afee-ce5041c3e7e0_1080x1080.jpeg&quot;,&quot;is_guest&quot;:false,&quot;bestseller_tier&quot;:null}],&quot;post_date&quot;:&quot;2024-12-07T17:59:48.898Z&quot;,&quot;cover_image&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/83b1135a-5a37-4906-867e-d524af8aae2b_1792x1024.webp&quot;,&quot;cover_image_alt&quot;:null,&quot;canonical_url&quot;:&quot;https://michalpitr.substack.com/p/linux-container-from-scratch&quot;,&quot;section_name&quot;:null,&quot;video_upload_id&quot;:null,&quot;id&quot;:152362649,&quot;type&quot;:&quot;newsletter&quot;,&quot;reaction_count&quot;:62,&quot;comment_count&quot;:0,&quot;publication_id&quot;:null,&quot;publication_name&quot;:&quot;From Scratch&quot;,&quot;publication_logo_url&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbe5ab36b-c782-420e-ac35-c7599a1f77ad_976x976.png&quot;,&quot;belowTheFold&quot;:true,&quot;youtube_url&quot;:null,&quot;show_links&quot;:null,&quot;feed_url&quot;:null}"></div><div class="digest-post-embed" data-attrs="{&quot;nodeId&quot;:&quot;6ba8a5e9-2ff4-4630-940c-9dfdd89aabfa&quot;,&quot;caption&quot;:&quot;One of the great joys of software engineering is dispelling magic. I&#8217;ve written code that executed on a GPU using frameworks like PyTorch or TensorFlow, but I never understood the &#8220;how&#8221;. It&#8217;s time to dispel the magic of GPU programming and learn how it works under the hood.&quot;,&quot;cta&quot;:null,&quot;showBylines&quot;:true,&quot;size&quot;:&quot;md&quot;,&quot;isEditorNode&quot;:true,&quot;title&quot;:&quot;GPU Programming&quot;,&quot;publishedBylines&quot;:[{&quot;id&quot;:28235731,&quot;name&quot;:&quot;Michal Pitr&quot;,&quot;bio&quot;:&quot;I write deep dives into software engineering topics.&quot;,&quot;photo_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/73a6e774-038e-424d-afee-ce5041c3e7e0_1080x1080.jpeg&quot;,&quot;is_guest&quot;:false,&quot;bestseller_tier&quot;:null}],&quot;post_date&quot;:&quot;2024-05-04T19:25:04.223Z&quot;,&quot;cover_image&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F60744dcf-3477-4aa5-9cdb-f020a0ba874c_2472x888.png&quot;,&quot;cover_image_alt&quot;:null,&quot;canonical_url&quot;:&quot;https://michalpitr.substack.com/p/gpu-programming&quot;,&quot;section_name&quot;:null,&quot;video_upload_id&quot;:null,&quot;id&quot;:144305968,&quot;type&quot;:&quot;newsletter&quot;,&quot;reaction_count&quot;:11,&quot;comment_count&quot;:0,&quot;publication_id&quot;:null,&quot;publication_name&quot;:&quot;From Scratch&quot;,&quot;publication_logo_url&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbe5ab36b-c782-420e-ac35-c7599a1f77ad_976x976.png&quot;,&quot;belowTheFold&quot;:true,&quot;youtube_url&quot;:null,&quot;show_links&quot;:null,&quot;feed_url&quot;:null}"></div><div class="digest-post-embed" data-attrs="{&quot;nodeId&quot;:&quot;1ddd3895-e934-4ff4-9d63-d5f80aef9702&quot;,&quot;caption&quot;:&quot;Recently I&#8217;ve been implementing a subset of SQLite (the world&#8217;s most used database, btw) from scratch in Go. I&#8217;ll share what I&#8217;ve learned about how SQLite stores data on disk which will help us understand key database concepts. Thanks for reading Michal&#8217;s Substack! Subscribe for free to receive new posts and support my work.&quot;,&quot;cta&quot;:null,&quot;showBylines&quot;:true,&quot;size&quot;:&quot;md&quot;,&quot;isEditorNode&quot;:true,&quot;title&quot;:&quot;How does SQLite store data?&quot;,&quot;publishedBylines&quot;:[{&quot;id&quot;:28235731,&quot;name&quot;:&quot;Michal Pitr&quot;,&quot;bio&quot;:&quot;I write deep dives into software engineering topics.&quot;,&quot;photo_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/73a6e774-038e-424d-afee-ce5041c3e7e0_1080x1080.jpeg&quot;,&quot;is_guest&quot;:false,&quot;bestseller_tier&quot;:null}],&quot;post_date&quot;:&quot;2024-03-17T16:50:48.826Z&quot;,&quot;cover_image&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc850055-defb-4bdb-8da8-779e45b482f5_997x387.png&quot;,&quot;cover_image_alt&quot;:null,&quot;canonical_url&quot;:&quot;https://michalpitr.substack.com/p/how-does-sqlite-store-data&quot;,&quot;section_name&quot;:null,&quot;video_upload_id&quot;:null,&quot;id&quot;:142692526,&quot;type&quot;:&quot;newsletter&quot;,&quot;reaction_count&quot;:27,&quot;comment_count&quot;:1,&quot;publication_id&quot;:null,&quot;publication_name&quot;:&quot;From Scratch&quot;,&quot;publication_logo_url&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbe5ab36b-c782-420e-ac35-c7599a1f77ad_976x976.png&quot;,&quot;belowTheFold&quot;:true,&quot;youtube_url&quot;:null,&quot;show_links&quot;:null,&quot;feed_url&quot;:null}"></div><p></p><p></p><p><br></p>]]></content:encoded></item><item><title><![CDATA[Linux container from scratch]]></title><description><![CDATA[Let's build a minimal container step-by-step in a terminal]]></description><link>https://michalpitr.substack.com/p/linux-container-from-scratch</link><guid isPermaLink="false">https://michalpitr.substack.com/p/linux-container-from-scratch</guid><dc:creator><![CDATA[Michal Pitr]]></dc:creator><pubDate>Sat, 07 Dec 2024 17:59:48 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/83b1135a-5a37-4906-867e-d524af8aae2b_1792x1024.webp" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>I recently built a docker clone from scratch in Go. This made me wonder - how hard would it be to do the same step-by-step in a terminal? Let&#8217;s find out!</p><h2>Safety warning</h2><p><em>If you do decide to follow along, I&#8217;d highly recommend to <strong>setup a Linux virtual machine</strong>. I&#8217;ll be running a bunch of privileged commands and I would like to avoid unintentionally nuking my readers&#8217; systems.</em></p><p>With the warning out of the way, let&#8217;s get into it!</p><h2>Container filesystem</h2><p>I&#8217;ll keep this section brief, for a deeper explanation of container filesystems, especially overlayFS, check out my <a href="https://michalpitr.substack.com/p/primer-on-linux-container-filesystems">previous post</a>. In essence, we create a directory structure for our container, download Alpine-based minirootfs, and mount it with overlayFS.</p><pre><code># create folder structure in a temporary directory
mkdir -p /tmp/container-1/{lower,upper,work,merged}

cd /tmp/container-1

# download alpine minirootfs
wget https://dl-cdn.alpinelinux.org/alpine/v3.20/releases/x86_64/alpine-minirootfs-3.20.3-x86_64.tar.gz

tar -xzf alpine-minirootfs-3.20.3-x86_64.tar.gz -C lower

# mount overlayFS, our container root will be in /tmp/container-1/merged
sudo mount -t overlay overlay -o lowerdir=lower,upperdir=upper,workdir=work merged</code></pre><p>After we run this, we should have a directory like this.</p><pre><code>michal@michal-lg:/tmp/container-1$ ls
alpine-minirootfs-3.20.3-x86_64.tar.gz  lower  merged  upper  work</code></pre><p>The container itself will use /tmp/container-1/merged as the root of its filesystem:</p><pre><code>michal@michal-lg:/tmp/container-1/merged$ ls
bin  etc   lib    mnt  proc  run   srv  tmp  var
dev  home  media  opt  root  sbin  sys  usr</code></pre><h2>Control groups</h2><p>Let&#8217;s restrict the resource consumption of this container to, say, 100m CPU and 500 MiB.</p><p>Setting up cgroups is super easy:</p><pre><code># make a new cgroup slice and a child cgroup for our container
sudo mkdir -p /sys/fs/cgroup/toydocker.slice/container-1

cd /sys/fs/cgroup/toydocker.slice/

# enable modifying cpu and memory for the child cgroup
sudo -- sh -c 'echo "+memory +cpu" &gt; cgroup.subtree_control'

cd container-1

# set max cpu usage to 10%
sudo -- sh -c 'echo "10000 100000" &gt; cpu.max'

# set memory limit to 500 MiB
sudo -- sh -c 'echo "500M" &gt; memory.max'

# Disable swap
sudo -- sh -c 'echo "0" &gt; memory.swap.max'</code></pre><p>The <code>cpu.max</code> syntax is a bit unusual, but it means that out of 100 000 time units, this cgroup can consume 10 000 of those units. If we instead wanted to limit the cgroup to 2 CPUs, it would be 200 000 out of 100 000.</p><p>Interestingly, the <code>cpu.max</code> rule doesn&#8217;t restrict the process to use a single physical core. So on a 4 core machine, it&#8217;s fine if a process uses 2500 time units on each of cores 0, 1, 2, 3, since the total is 10 000. For limiting the number of physical cores to use, <a href="https://man7.org/linux/man-pages/man7/cpuset.7.html">cpusets</a> can be used.</p><p>We can see that when we created the cgroup, default rules were automatically set up. </p><pre><code>michal@michal-lg:/sys/fs/cgroup/toydocker.slice/container-1$ ls
cgroup.controllers      cpu.pressure         memory.numa_stat
cgroup.events           cpu.stat             memory.oom.group
cgroup.freeze           cpu.stat.local       memory.peak
cgroup.kill             cpu.uclamp.max       memory.pressure
cgroup.max.depth        cpu.uclamp.min       memory.reclaim
cgroup.max.descendants  cpu.weight           memory.stat
cgroup.pressure         cpu.weight.nice      memory.swap.current
cgroup.procs            io.pressure          memory.swap.events
cgroup.stat             memory.current       memory.swap.high
cgroup.subtree_control  memory.events        memory.swap.max
cgroup.threads          memory.events.local  memory.swap.peak
cgroup.type             memory.high          memory.zswap.current
cpu.idle                memory.low           memory.zswap.max
cpu.max                 memory.max           memory.zswap.writeback
cpu.max.burst           memory.min</code></pre><p>Let&#8217;s check that the ones we modified took effect:</p><pre><code>michal@michal-lg:/sys/fs/cgroup/toydocker.slice/container-1$ cat cpu.max
10000 100000
michal@michal-lg:/sys/fs/cgroup/toydocker.slice/container-1$ cat memory.max
524288000
michal@michal-lg:/sys/fs/cgroup/toydocker.slice/container-1$ cat memory.swap.max
0</code></pre><p>Looks good. Next, let&#8217;s see how we can put a process into the cgroup and further isolate it via namespaces.</p><h2>Namespaces</h2><p>Let&#8217;s first answer the <strong>why</strong> of namespaces, then we can see how they are used.</p><p>If cgroups are the main mechanism for restricting resource usage, namespaces are the main mechanism for isolating resources themselves. </p><p>Let&#8217;s take filesystem mounts as an example. When we mount a new filesystem in the host, it&#8217;s visible to all processes. We need to be aware of what other mounts exist on a system to avoid clashes. With mount namespace, each process can make filesystem changes as it wishes without affecting any other process outside of this namespace.</p><p>The same idea extends to other resources: networking, inter-process communication, process ids, users, etc. </p><p>With motivation out of the way, let&#8217;s see it in action.</p><pre><code># enter interactive root
sudo -i

# Add current process to cgroup
echo $$ &gt; /sys/fs/cgroup/toydocker.slice/container-1/cgroup.procs

# Create new namespaces
unshare \
    --uts \
    --pid \
    --mount \
    --mount-proc \
    --net \
    --ipc \
    --cgroup \
    --fork \
    /bin/bash</code></pre><p>This piece of code is a little arcane, mostly because I want to keep everything in a single terminal. </p><p>First, we enter an interactive root shell. This is because I need to run the next two commands from the same shell and both with root privileges.</p><pre><code>michal@michal-lg:~$ # enter interactive root
sudo -i
[sudo] password for michal: 
root@michal-lg:~# </code></pre><p>The second command adds the current shell process to the cgroup we created earlier. Any children of this process will also automatically be added to the cgroup.</p><pre><code>root@michal-lg:~# echo $$
28156
root@michal-lg:~# echo $$ &gt; /sys/fs/cgroup/toydocker.slice/container-1/cgroup.procs</code></pre><p>When we do this, the current shell is part of the cgroup and all the CPU and memory restrictions we set up earlier already apply.</p><p>Next, when we create the namespaces, it forks the current process and runs a bash shell. You can learn more about the <a href="https://man7.org/linux/man-pages/man1/unshare.1.html">unshare command in man pages</a>.</p><pre><code>root@michal-lg:~# unshare \
    --uts \
    --pid \
    --mount \
    --mount-proc \
    --net \
    --ipc \
    --cgroup \
    --fork \
    /bin/bash
root@michal-lg:~# </code></pre><p>This looks unremarkable, but we have essentially created a container through cgroup and namespace isolation. Let&#8217;s test that the UTS namespace is working correctly by changing the hostname and seeing that it doesn&#8217;t change on the host.</p><p>Container terminal:</p><pre><code>root@michal-lg:~# hostname
michal-lg
root@michal-lg:~# hostname mycontainer
root@michal-lg:~# hostname
mycontainer
root@michal-lg:~# </code></pre><p>Host terminal:</p><pre><code>michal@michal-lg:~$ hostname
michal-lg</code></pre><p>Since we also used `pid` namespace, `/bin/bash` should now have process id of 1. Let&#8217;s verify from the container:</p><pre><code>root@michal-lg:~# ps
    PID TTY          TIME CMD
      1 pts/1    00:00:00 bash
     32 pts/1    00:00:00 ps</code></pre><p>And let&#8217;s see what the real process id is from the host&#8217;s perspective.</p><pre><code>michal@michal-lg:~$ ps -ef | grep -i /bin/bash
root        8952    8932  0 16:10 pts/1    00:00:00 unshare --uts --pid --mount --mount-proc --net --ipc --cgroup --fork /bin/bash
<strong>root        8953    8952  0 16:10 pts/1    00:00:00 /bin/bash</strong></code></pre><p>There are some post-processing steps that a container runtime would do at this point before launching user&#8217;s application. Let&#8217;s go through those next.</p><h2>Container-side setup</h2><p>First, we isolate the container from the host filesystem by changing the root using the <a href="https://man7.org/linux/man-pages/man2/pivot_root.2.html">pivot_root</a> command. </p><p><code>pivot_root</code> is a safer equivalent to <code>chroot /tmp/container-1/merged</code> used by container runtimes to avoid breakout exploits. Security is not my expertise, so I&#8217;ll link to <a href="https://tbhaxor.com/pivot-root-vs-chroot-for-containers/">this article </a>explaining how these exploits work and how <code>pivot_root</code> prevents them.</p><pre><code>root@michal-lg:~# cd /tmp/container-1/merged
mount --make-rprivate /
mkdir old_root
pivot_root . old_root
umount -l /old_root
rm -rf /old_root
root@michal-lg:/tmp/container-1/merged# </code></pre><p>Making the root private prevents the container from affecting host&#8217;s mount table. This could again be used for exploits.</p><p>In my terminal, I need to run &#8220;cd ..&#8221; to refresh state after we deleted the old root. Since we removed old root, PATH variables no longer resolve correctly. </p><p>But since we are now in the `/tmp/container-1/merged` directory and this filesystem is based on Alpine minirootfs, we have basic utilities in the `bin` directory.</p><pre><code>root@michal-lg:/tmp/container-1/merged# cd ..
root@michal-lg:/# ls
bash: /usr/bin/ls: No such file or directory
root@michal-lg:/# /bin/ls
bin    dev    etc    home   lib    media  mnt    opt    proc   root   run    sbin   srv    sys    tmp    usr    var</code></pre><p>Let&#8217;s also setup basic devices we&#8217;ll need later and mount useful virtual filesystems:</p><pre><code>mknod -m 666 dev/null c 1 3
mknod -m 666 dev/zero c 1 5
mknod -m 666 dev/tty c 5 0</code></pre><pre><code>/bin/mkdir -p dev/{pts,shm}
/bin/mount -t devpts devpts dev/pts
/bin/mount -t tmpfs tmpfs dev/shm
/bin/mount -t sysfs sysfs sys/
/bin/mount -t tmpfs tmpfs run/
/bin/mount -t proc proc proc/</code></pre><p>For instance, if we didn&#8217;t mount `proc`, we wouldn&#8217;t have access to process information and running commands that depend on reading process info would fail:</p><pre><code>root@michal-lg:/# top
top: no process info in /proc</code></pre><p>After the mount, things work correctly again.</p><pre><code>Mem: 7560280K used, 8661696K free, 161756K shrd, 135464K buff, 2364264K cached
CPU:   0% usr   0% sys   0% nic  98% idle   0% io   0% irq   0% sirq
Load average: 0.30 0.38 0.37 1/1233 64
  PID  PPID USER     STAT   VSZ %VSZ CPU %CPU COMMAND
    1     0 root     S    12896   0%   6   0% /bin/bash
   64     1 root     R     1624   0%   9   0% top</code></pre><p>At this point, we could configure networking, export env variables, etc. For our minimal purposes, we are done and it&#8217;s time to launch user&#8217;s application! </p><p>Let&#8217;s suppose the user wanted to run a simple interactive shell. We can launch it like this:</p><pre><code>exec /bin/busybox sh</code></pre><p>I use busybox since it works as a minimal init script and it ships in alpine minirootfs. Using exec replaces the old process with the new one since we don&#8217;t need to keep the shell around.</p><pre><code>root@michal-lg:/# exec /bin/busybox sh
/ # ls
bin    dev    etc    home   lib    media  mnt    opt    proc   root   run    sbin   srv    sys    tmp    usr    var
/ # </code></pre><p>Right now, we are roughly where we&#8217;d be if we ran the following docker command:</p><pre><code>michal@michal-lg:~$ docker run -it --cpus="0.1" --memory="512M" --memory-swap=0 --entrypoint /bin/sh --rm alpine
/ # ls
bin    dev    etc    home   lib    media  mnt    opt    proc   root   run    sbin   srv    sys    tmp    usr    var
/ # </code></pre><h2>Using the container</h2><p>As a final step, let&#8217;s try to see that the cgroup limits we set earlier are actually working.</p><p>First, I&#8217;ll run a CPU intensive task that should use 100% of a single CPU core</p><pre><code>/ # while true; do true; done</code></pre><p>and open a terminal from host to see the real CPU utilization for this process. I find the process id and check consumption with top:</p><pre><code>michal@michal-lg:~$ ps -ef | grep -i busybox
root        8953    8952  0 16:10 pts/1    00:00:07 /bin/busybox sh</code></pre><p>And then use `top` to verify that CPU usage doesn&#8217;t exceed 10%.</p><pre><code>michal@michal-lg:~$ top -p 8953
PID USER      PR  NI    VIRT    RES    SHR S  <strong>%CPU</strong>  %MEM     TIME+ COMMAND                                                                                                                           
   8953 root      20   0    1696   1024    896 R  <strong>10.0</strong>   0.0   0:15.25 busybox    </code></pre><p>Similarly, for memory, I run <code>tail</code> to keep reading from<code> /dev/zero</code>. <code>tail</code> reads into an in-memory buffer that will shortly exceed our 500 MiB memory limit at which point the cgroup memory controller will kill the process.</p><pre><code>/ # tail /dev/zero
Killed</code></pre><p>We can now exit the container, and cleanup by unmounting  the root directory <code>/tmp/container-1/merged</code></p><pre><code>michal@michal-lg:/tmp/container-1$ sudo umount merged</code></pre><p>And that&#8217;s it! We&#8217;ve created a container from scratch in a terminal.</p><h2>Conclusion</h2><p>The main takeaway should be that containers aren&#8217;t magic. They are not virtual machines. They are an awesome feature baked into the Linux kernel for isolating processes. They achieve this isolation through cgroups and namespaces.</p><p>You can see the full-list of commands on my <a href="https://github.com/MichalPitr/toy-docker/blob/main/docker_bash.md">github</a>.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://michalpitr.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading From Scratch! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><div><hr></div><p>I hope you learned something new! If you did, consider subscribing! I&#8217;m also always happy to connect with readers on <a href="https://www.linkedin.com/in/michal-pitr-a7156b127/">LinkedIn</a> and <a href="https://bsky.app/profile/mptr.bsky.social">BlueSky</a></p><p>If you enjoyed the post, chances are you&#8217;ll enjoy my other writing too! All my posts are backed by a substantial deep-dive into the given problem space.</p><p>Perhaps check out my deep-dive into SQLite storage format, my implementation of MapReduce from scratch, or my introduction to CUDA programming?</p><div class="digest-post-embed" data-attrs="{&quot;nodeId&quot;:&quot;313eaad6-45d0-4888-beec-c9b6fa1cff04&quot;,&quot;caption&quot;:&quot;Recently I&#8217;ve been implementing a subset of SQLite (the world&#8217;s most used database, btw) from scratch in Go. I&#8217;ll share what I&#8217;ve learned about how SQLite stores data on disk which will help us understand key database concepts. Thanks for reading Michal&#8217;s Substack! Subscribe for free to receive new posts and support my work.&quot;,&quot;cta&quot;:null,&quot;showBylines&quot;:true,&quot;size&quot;:&quot;sm&quot;,&quot;isEditorNode&quot;:true,&quot;title&quot;:&quot;How does SQLite store data?&quot;,&quot;publishedBylines&quot;:[{&quot;id&quot;:28235731,&quot;name&quot;:&quot;Michal Pitr&quot;,&quot;bio&quot;:&quot;I write deep dives into software engineering topics.&quot;,&quot;photo_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/73a6e774-038e-424d-afee-ce5041c3e7e0_1080x1080.jpeg&quot;,&quot;is_guest&quot;:false,&quot;bestseller_tier&quot;:null}],&quot;post_date&quot;:&quot;2024-03-17T16:50:48.826Z&quot;,&quot;cover_image&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc850055-defb-4bdb-8da8-779e45b482f5_997x387.png&quot;,&quot;cover_image_alt&quot;:null,&quot;canonical_url&quot;:&quot;https://michalpitr.substack.com/p/how-does-sqlite-store-data&quot;,&quot;section_name&quot;:null,&quot;video_upload_id&quot;:null,&quot;id&quot;:142692526,&quot;type&quot;:&quot;newsletter&quot;,&quot;reaction_count&quot;:20,&quot;comment_count&quot;:0,&quot;publication_id&quot;:null,&quot;publication_name&quot;:&quot;From Scratch&quot;,&quot;publication_logo_url&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbe5ab36b-c782-420e-ac35-c7599a1f77ad_976x976.png&quot;,&quot;belowTheFold&quot;:true,&quot;youtube_url&quot;:null,&quot;show_links&quot;:null,&quot;feed_url&quot;:null}"></div><div class="digest-post-embed" data-attrs="{&quot;nodeId&quot;:&quot;a7045b0e-7bc5-4aeb-becb-31a20500d530&quot;,&quot;caption&quot;:&quot;Over the last couple of weeks, I&#8217;ve been building MapReduce from scratch.&quot;,&quot;cta&quot;:null,&quot;showBylines&quot;:true,&quot;size&quot;:&quot;sm&quot;,&quot;isEditorNode&quot;:true,&quot;title&quot;:&quot;MapReduce from Scratch&quot;,&quot;publishedBylines&quot;:[{&quot;id&quot;:28235731,&quot;name&quot;:&quot;Michal Pitr&quot;,&quot;bio&quot;:&quot;I write deep dives into software engineering topics.&quot;,&quot;photo_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/73a6e774-038e-424d-afee-ce5041c3e7e0_1080x1080.jpeg&quot;,&quot;is_guest&quot;:false,&quot;bestseller_tier&quot;:null}],&quot;post_date&quot;:&quot;2024-04-28T21:33:39.100Z&quot;,&quot;cover_image&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc1d3a7a5-e2cc-4778-bafe-03f87e6a6884_1600x671.png&quot;,&quot;cover_image_alt&quot;:null,&quot;canonical_url&quot;:&quot;https://michalpitr.substack.com/p/mapreduce-from-scratch&quot;,&quot;section_name&quot;:null,&quot;video_upload_id&quot;:null,&quot;id&quot;:144104758,&quot;type&quot;:&quot;newsletter&quot;,&quot;reaction_count&quot;:13,&quot;comment_count&quot;:4,&quot;publication_id&quot;:null,&quot;publication_name&quot;:&quot;From Scratch&quot;,&quot;publication_logo_url&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbe5ab36b-c782-420e-ac35-c7599a1f77ad_976x976.png&quot;,&quot;belowTheFold&quot;:true,&quot;youtube_url&quot;:null,&quot;show_links&quot;:null,&quot;feed_url&quot;:null}"></div><div class="digest-post-embed" data-attrs="{&quot;nodeId&quot;:&quot;6c48da5e-56a0-4945-8efe-0d88d3728c5b&quot;,&quot;caption&quot;:&quot;One of the great joys of software engineering is dispelling magic. I&#8217;ve written code that executed on a GPU using frameworks like PyTorch or TensorFlow, but I never understood the &#8220;how&#8221;. It&#8217;s time to dispel the magic of GPU programming and learn how it works under the hood.&quot;,&quot;cta&quot;:null,&quot;showBylines&quot;:true,&quot;size&quot;:&quot;sm&quot;,&quot;isEditorNode&quot;:true,&quot;title&quot;:&quot;GPU Programming&quot;,&quot;publishedBylines&quot;:[{&quot;id&quot;:28235731,&quot;name&quot;:&quot;Michal Pitr&quot;,&quot;bio&quot;:&quot;I write deep dives into software engineering topics.&quot;,&quot;photo_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/73a6e774-038e-424d-afee-ce5041c3e7e0_1080x1080.jpeg&quot;,&quot;is_guest&quot;:false,&quot;bestseller_tier&quot;:null}],&quot;post_date&quot;:&quot;2024-05-04T19:25:04.223Z&quot;,&quot;cover_image&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F60744dcf-3477-4aa5-9cdb-f020a0ba874c_2472x888.png&quot;,&quot;cover_image_alt&quot;:null,&quot;canonical_url&quot;:&quot;https://michalpitr.substack.com/p/gpu-programming&quot;,&quot;section_name&quot;:null,&quot;video_upload_id&quot;:null,&quot;id&quot;:144305968,&quot;type&quot;:&quot;newsletter&quot;,&quot;reaction_count&quot;:6,&quot;comment_count&quot;:0,&quot;publication_id&quot;:null,&quot;publication_name&quot;:&quot;From Scratch&quot;,&quot;publication_logo_url&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbe5ab36b-c782-420e-ac35-c7599a1f77ad_976x976.png&quot;,&quot;belowTheFold&quot;:true,&quot;youtube_url&quot;:null,&quot;show_links&quot;:null,&quot;feed_url&quot;:null}"></div>]]></content:encoded></item><item><title><![CDATA[Primer on Linux container filesystems]]></title><description><![CDATA[Building a container filessytem by hand]]></description><link>https://michalpitr.substack.com/p/primer-on-linux-container-filesystems</link><guid isPermaLink="false">https://michalpitr.substack.com/p/primer-on-linux-container-filesystems</guid><dc:creator><![CDATA[Michal Pitr]]></dc:creator><pubDate>Sat, 16 Nov 2024 16:55:24 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8b7675de-c957-4720-a480-89698f8afdda_2310x842.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>I spent the weekend building a toy Docker clone. One question going into this project was how each container gets its own filesystem. Let&#8217;s first reverse-engineer what Docker does and then replicate it ourselves.</p><p>I&#8217;ll start by starting a shell in a Docker container using Alpine image.</p><pre><code>michal@michal-lg:~$ docker run -it --entrypoint /bin/sh --rm --name "alpine-container" alpine
/ # ls
bin&nbsp; &nbsp; dev&nbsp; &nbsp; etc&nbsp; &nbsp; home &nbsp; lib&nbsp; &nbsp; media&nbsp; mnt&nbsp; &nbsp; opt&nbsp; &nbsp; proc &nbsp; root &nbsp; run&nbsp; &nbsp; sbin &nbsp; srv&nbsp; &nbsp; sys&nbsp; &nbsp; tmp&nbsp; &nbsp; usr&nbsp; &nbsp; var
/ # hostname
a7cbf0aea1ad
/ # cd home &amp;&amp; ls</code></pre><p>We are in a separate filesystem and it&#8217;s pretty empty. Let&#8217;s make a file.</p><pre><code>/ # echo -e "Hello there\nGeneral Kenobi" &gt; /home/hello_there.txt
/ # cat /home/hello_there.txt&nbsp;
Hello there
General Kenobi</code></pre><p>What do you think, can we access this file from the host?<br>&#8230;</p><p>&#8230;</p><p>&#8230;<br>We can - let&#8217;s find it!<br><br>Docker stores everything under <code>/var/lib/docker</code> so we can start looking from there from a second terminal:</p><pre><code>root@michal-lg:/var/lib/docker# find -name hello_there.txt
./overlay2/1557145fe40a1595d090eeafa72c39a7b54cca4791ae9e3ffafabff06466125c/diff/home/hello_there.txt
./overlay2/1557145fe40a1595d090eeafa72c39a7b54cca4791ae9e3ffafabff06466125c/merged/home/hello_there.txt
root@michal-lg:/var/lib/docker#&nbsp;</code></pre><p>Curiously, we found the file twice in two different directories. Let&#8217;s see what else is in the <code>diff</code> and <code>merged</code> directories.</p><pre><code>root@michal-lg:/var/lib/docker/.../diff# ls
home&nbsp; root
root@michal-lg:/var/lib/docker/.../merged# ls
bin&nbsp; dev&nbsp; etc&nbsp; home&nbsp; lib&nbsp; media&nbsp; mnt&nbsp; opt&nbsp; proc&nbsp; root&nbsp; run&nbsp; sbin&nbsp; srv&nbsp; sys&nbsp; tmp&nbsp; usr&nbsp; var</code></pre><p>The <code>diff</code> directory only contains an empty <code>root</code> directory and a <code>home</code> directory with the file we created. Contents of <code>merged</code> exactly match the container&#8217;s filesystem.</p><h3></h3><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://michalpitr.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Michal&#8217;s Deep Dives! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><h3>Overlayfs</h3><p>Docker uses overlayfs filesystem. Overlayfs lets us combine 2 file file trees, &#8220;lower&#8221; and &#8220;upper&#8221;, into a combined view &#8220;merged&#8221;. Docker calls the &#8220;upper&#8221; file tree &#8220;diff&#8221;, which is perhaps more fitting and I&#8217;ll be referring to it as such.</p><p>Usage of union filesystems like overlayfs comes from an interesting observation: Often we want to run multiple containers on a single host. Chances are, these containers might share the lower layer - be it Alpine, Ubuntu, or a more specialized one like golang.</p><p>By making the lower layer read-only, multiple containers can share it. Changes are only written to the upper layer. Let&#8217;s illustrate what happened when I earlier created <code>hello_there.txt</code>. I&#8217;m not expanding all folders to avoid clutter.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!KExE!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6c299658-4d92-47ed-b49a-d0438ab31efb_2309x667.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!KExE!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6c299658-4d92-47ed-b49a-d0438ab31efb_2309x667.png 424w, https://substackcdn.com/image/fetch/$s_!KExE!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6c299658-4d92-47ed-b49a-d0438ab31efb_2309x667.png 848w, https://substackcdn.com/image/fetch/$s_!KExE!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6c299658-4d92-47ed-b49a-d0438ab31efb_2309x667.png 1272w, https://substackcdn.com/image/fetch/$s_!KExE!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6c299658-4d92-47ed-b49a-d0438ab31efb_2309x667.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!KExE!,w_2400,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6c299658-4d92-47ed-b49a-d0438ab31efb_2309x667.png" width="1200" height="346.97802197802196" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/6c299658-4d92-47ed-b49a-d0438ab31efb_2309x667.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:false,&quot;imageSize&quot;:&quot;large&quot;,&quot;height&quot;:421,&quot;width&quot;:1456,&quot;resizeWidth&quot;:1200,&quot;bytes&quot;:133042,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-large" alt="" srcset="https://substackcdn.com/image/fetch/$s_!KExE!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6c299658-4d92-47ed-b49a-d0438ab31efb_2309x667.png 424w, https://substackcdn.com/image/fetch/$s_!KExE!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6c299658-4d92-47ed-b49a-d0438ab31efb_2309x667.png 848w, https://substackcdn.com/image/fetch/$s_!KExE!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6c299658-4d92-47ed-b49a-d0438ab31efb_2309x667.png 1272w, https://substackcdn.com/image/fetch/$s_!KExE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6c299658-4d92-47ed-b49a-d0438ab31efb_2309x667.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>When we then create <code>hello_there.txt</code> in <code>/home</code>, it was written to <code>diff</code> and overlayfs constructed a combined view in <code>merged</code>.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!bwX0!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8b7675de-c957-4720-a480-89698f8afdda_2310x842.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!bwX0!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8b7675de-c957-4720-a480-89698f8afdda_2310x842.png 424w, https://substackcdn.com/image/fetch/$s_!bwX0!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8b7675de-c957-4720-a480-89698f8afdda_2310x842.png 848w, https://substackcdn.com/image/fetch/$s_!bwX0!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8b7675de-c957-4720-a480-89698f8afdda_2310x842.png 1272w, https://substackcdn.com/image/fetch/$s_!bwX0!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8b7675de-c957-4720-a480-89698f8afdda_2310x842.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!bwX0!,w_2400,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8b7675de-c957-4720-a480-89698f8afdda_2310x842.png" width="1200" height="437.6373626373626" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/8b7675de-c957-4720-a480-89698f8afdda_2310x842.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:false,&quot;imageSize&quot;:&quot;large&quot;,&quot;height&quot;:531,&quot;width&quot;:1456,&quot;resizeWidth&quot;:1200,&quot;bytes&quot;:159847,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-large" alt="" srcset="https://substackcdn.com/image/fetch/$s_!bwX0!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8b7675de-c957-4720-a480-89698f8afdda_2310x842.png 424w, https://substackcdn.com/image/fetch/$s_!bwX0!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8b7675de-c957-4720-a480-89698f8afdda_2310x842.png 848w, https://substackcdn.com/image/fetch/$s_!bwX0!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8b7675de-c957-4720-a480-89698f8afdda_2310x842.png 1272w, https://substackcdn.com/image/fetch/$s_!bwX0!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8b7675de-c957-4720-a480-89698f8afdda_2310x842.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>What if we modify something in the lower layer? Let&#8217;s rename <code>/bin/echo</code> to <code>/bin/echo.old</code>.</p><pre><code>/ # cd bin
/bin # mv echo echo.old
/bin # ls
...
chgrp          <strong>echo.old</strong>       gzip           ln             mount          printenv       setserial      umount
...</code></pre><p>As mentioned, the lower layer is read-only so the only place where the file is actually modified is in <code>diff</code>. </p><pre><code>root@michal-lg:/var/lib/docker/overlay2/.../diff/bin# ls -l
total 0
c--------- 1 root root 0, 0 Nov 11 23:06 echo
lrwxrwxrwx 1 root root &nbsp; 12 Sep&nbsp; 6 13:34 echo.old -&gt; /bin/busybox</code></pre><p>There are two files! One for the no longer existing <code>echo</code> and one for <code>echo.old</code>. Overlayfs uses special <a href="https://docs.kernel.org/filesystems/overlayfs.html#whiteouts-and-opaque-directories">whiteout files</a> to deal with deletion of a file. When overlayfs sees this file, it knows not to include it in the merged view.<br><br>The second file is much less interesting, it&#8217;s the renamed echo, which turns out was just a symbolic link to busybox.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Tp23!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2904c5c3-9ee9-4dc8-a58a-31c4ee0f915a_2524x904.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Tp23!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2904c5c3-9ee9-4dc8-a58a-31c4ee0f915a_2524x904.png 424w, https://substackcdn.com/image/fetch/$s_!Tp23!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2904c5c3-9ee9-4dc8-a58a-31c4ee0f915a_2524x904.png 848w, https://substackcdn.com/image/fetch/$s_!Tp23!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2904c5c3-9ee9-4dc8-a58a-31c4ee0f915a_2524x904.png 1272w, https://substackcdn.com/image/fetch/$s_!Tp23!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2904c5c3-9ee9-4dc8-a58a-31c4ee0f915a_2524x904.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Tp23!,w_2400,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2904c5c3-9ee9-4dc8-a58a-31c4ee0f915a_2524x904.png" width="1200" height="429.3956043956044" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/2904c5c3-9ee9-4dc8-a58a-31c4ee0f915a_2524x904.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:false,&quot;imageSize&quot;:&quot;large&quot;,&quot;height&quot;:521,&quot;width&quot;:1456,&quot;resizeWidth&quot;:1200,&quot;bytes&quot;:187121,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-large" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Tp23!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2904c5c3-9ee9-4dc8-a58a-31c4ee0f915a_2524x904.png 424w, https://substackcdn.com/image/fetch/$s_!Tp23!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2904c5c3-9ee9-4dc8-a58a-31c4ee0f915a_2524x904.png 848w, https://substackcdn.com/image/fetch/$s_!Tp23!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2904c5c3-9ee9-4dc8-a58a-31c4ee0f915a_2524x904.png 1272w, https://substackcdn.com/image/fetch/$s_!Tp23!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2904c5c3-9ee9-4dc8-a58a-31c4ee0f915a_2524x904.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h3>Creating container filesystem</h3><p>Finally, let&#8217;s see how Docker uses Overlayfs under the hood to create a new filesystem.</p><p>First, let&#8217;s create temporary directories in <code>/tmp/ </code>where we&#8217;ll setup a container filesystem manually.</p><pre><code>michal@michal-lg:/tmp$ mkdir -p /tmp/container-demo/{diff,merged,work}
michal@michal-lg:/tmp$ ls container-demo/
diff&nbsp; merged&nbsp; work</code></pre><p>We&#8217;ve seen <code>diff</code> and <code>merged</code> before. <code>work</code> is used by overlayfs as a scratchpad for and we don&#8217;t need to care about it. </p><p>Next, download Alpine minirootfs for your CPU architecture and extract it to&nbsp; <code>/tmp/container-demo/</code>. I&#8217;ll rename the extracted folder to &#8220;<code>alpine</code>&#8221;.</p><pre><code>michal@michal-lg:/tmp/container-demo$ ls
alpine&nbsp; diff&nbsp; merged&nbsp; work</code></pre><p>Now that everything is setup, we can <a href="https://man7.org/linux/man-pages/man8/mount.8.html">mount the overlayfs filesystem</a>.</p><pre><code>michal@michal-lg:/tmp/container-demo$&nbsp;sudo mount -t overlay&nbsp; overlay&nbsp; -o lowerdir=alpine,upperdir=diff,workdir=work&nbsp; merged</code></pre><p>Now if we list contents of <code>merged</code>, we&#8217;ll see the Alpine file system:</p><pre><code>michal@michal-lg:/tmp/container-demo/merged$ ls
bin&nbsp; etc &nbsp; lib&nbsp; &nbsp; mnt&nbsp; proc&nbsp; run &nbsp; srv&nbsp; tmp&nbsp; var
dev&nbsp; home&nbsp; media&nbsp; opt&nbsp; root&nbsp; sbin&nbsp; sys&nbsp; usr</code></pre><p>And if we create a file in there, it will be written to the <code>diff</code> folder.</p><pre><code>michal@michal-lg:/tmp/container-demo/merged$ echo hello &gt; hello.txt
michal@michal-lg:/tmp/container-demo/merged$ ls
bin&nbsp; etc&nbsp; &nbsp; &nbsp; &nbsp; home&nbsp; media&nbsp; opt &nbsp; root&nbsp; sbin&nbsp; sys&nbsp; usr
dev&nbsp; hello.txt&nbsp; lib &nbsp; mnt&nbsp; &nbsp; proc&nbsp; run &nbsp; srv &nbsp; tmp&nbsp; var
michal@michal-lg:/tmp/container-demo/merged$ ls ../diff/
hello.txt</code></pre><p>As a final point, we can create a new shell process and set <code>merged</code> as its root directory. This is how Linux containers can only see their own filesystem. Any process spawned by this shell will inherit the root so it will also be contained to this filesystem.</p><pre><code>michal@michal-lg:/tmp/container-demo$ sudo chroot merged /bin/sh
/ # ls
bin&nbsp; &nbsp; &nbsp; &nbsp; hello.txt&nbsp; media&nbsp; &nbsp; &nbsp; proc &nbsp; &nbsp; &nbsp; sbin &nbsp; &nbsp; &nbsp; tmp
dev&nbsp; &nbsp; &nbsp; &nbsp; home &nbsp; &nbsp; &nbsp; mnt&nbsp; &nbsp; &nbsp; &nbsp; root &nbsp; &nbsp; &nbsp; srv&nbsp; &nbsp; &nbsp; &nbsp; usr
etc&nbsp; &nbsp; &nbsp; &nbsp; lib&nbsp; &nbsp; &nbsp; &nbsp; opt&nbsp; &nbsp; &nbsp; &nbsp; run&nbsp; &nbsp; &nbsp; &nbsp; sys&nbsp; &nbsp; &nbsp; &nbsp; var
/ # cd ..
/ #&nbsp;</code></pre><p>A full implementation would take advantage of namespaces for additional isolation. You can learn more about those in Linux <a href="https://man7.org/linux/man-pages/man7/namespaces.7.html">man-pages</a>.</p><h3>Conclusion</h3><p>Containers are a black box for vast majority of engineers. You should now have a solid understanding of how things work under the hood! </p><p>If you are interested in a more complete implementation of Docker from scratch, consider checkout my<a href="https://github.com/MichalPitr/toy-docker"> Golang Docker clone </a>that&#8217;s around 200 lines of code.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://michalpitr.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><div><hr></div><p><br>I hope you learned something new! If you did, consider subscribing and/or following me on <a href="https://www.linkedin.com/in/michal-pitr-a7156b127/">LinkedIn</a>.</p><p>You might also enjoy some of my other posts linked below. Perhaps my deep-dive into SQLite storage format, my implementation of MapReduce from scratch, or my series building an ML inference engine from scratch?</p><div class="captioned-button-wrap" data-attrs="{&quot;url&quot;:&quot;https://michalpitr.substack.com/p/primer-on-linux-container-filesystems?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;}" data-component-name="CaptionedButtonToDOM"><div class="preamble"><p class="cta-caption">Thanks for reading Michal&#8217;s Deep Dives! This post is public so feel free to share it.</p></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://michalpitr.substack.com/p/primer-on-linux-container-filesystems?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://michalpitr.substack.com/p/primer-on-linux-container-filesystems?utm_source=substack&utm_medium=email&utm_content=share&action=share"><span>Share</span></a></p></div><p></p><div class="digest-post-embed" data-attrs="{&quot;nodeId&quot;:&quot;b6757444-53bf-46b8-9992-7881358f5f55&quot;,&quot;caption&quot;:&quot;Recently I&#8217;ve been implementing a subset of SQLite (the world&#8217;s most used database, btw) from scratch in Go. I&#8217;ll share what I&#8217;ve learned about how SQLite stores data on disk which will help us understand key database concepts. Thanks for reading Michal&#8217;s Substack! Subscribe for free to receive new posts and support my work.&quot;,&quot;cta&quot;:null,&quot;showBylines&quot;:true,&quot;size&quot;:&quot;md&quot;,&quot;isEditorNode&quot;:true,&quot;title&quot;:&quot;How does SQLite store data?&quot;,&quot;publishedBylines&quot;:[{&quot;id&quot;:28235731,&quot;name&quot;:&quot;Michal Pitr&quot;,&quot;bio&quot;:&quot;I write deep dives into software engineering topics.&quot;,&quot;photo_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/73a6e774-038e-424d-afee-ce5041c3e7e0_1080x1080.jpeg&quot;,&quot;is_guest&quot;:false,&quot;bestseller_tier&quot;:null}],&quot;post_date&quot;:&quot;2024-03-17T16:50:48.826Z&quot;,&quot;cover_image&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc850055-defb-4bdb-8da8-779e45b482f5_997x387.png&quot;,&quot;cover_image_alt&quot;:null,&quot;canonical_url&quot;:&quot;https://michalpitr.substack.com/p/how-does-sqlite-store-data&quot;,&quot;section_name&quot;:null,&quot;video_upload_id&quot;:null,&quot;id&quot;:142692526,&quot;type&quot;:&quot;newsletter&quot;,&quot;reaction_count&quot;:19,&quot;comment_count&quot;:0,&quot;publication_id&quot;:null,&quot;publication_name&quot;:&quot;Michal&#8217;s Deep Dives&quot;,&quot;publication_logo_url&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbe5ab36b-c782-420e-ac35-c7599a1f77ad_976x976.png&quot;,&quot;belowTheFold&quot;:true,&quot;youtube_url&quot;:null,&quot;show_links&quot;:null,&quot;feed_url&quot;:null}"></div><div class="digest-post-embed" data-attrs="{&quot;nodeId&quot;:&quot;8607921e-addc-4d6c-bd17-c6ce830196e0&quot;,&quot;caption&quot;:&quot;Over the last couple of weeks, I&#8217;ve been building MapReduce from scratch.&quot;,&quot;cta&quot;:null,&quot;showBylines&quot;:true,&quot;size&quot;:&quot;md&quot;,&quot;isEditorNode&quot;:true,&quot;title&quot;:&quot;MapReduce from Scratch&quot;,&quot;publishedBylines&quot;:[{&quot;id&quot;:28235731,&quot;name&quot;:&quot;Michal Pitr&quot;,&quot;bio&quot;:&quot;I write deep dives into software engineering topics.&quot;,&quot;photo_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/73a6e774-038e-424d-afee-ce5041c3e7e0_1080x1080.jpeg&quot;,&quot;is_guest&quot;:false,&quot;bestseller_tier&quot;:null}],&quot;post_date&quot;:&quot;2024-04-28T21:33:39.100Z&quot;,&quot;cover_image&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc1d3a7a5-e2cc-4778-bafe-03f87e6a6884_1600x671.png&quot;,&quot;cover_image_alt&quot;:null,&quot;canonical_url&quot;:&quot;https://michalpitr.substack.com/p/mapreduce-from-scratch&quot;,&quot;section_name&quot;:null,&quot;video_upload_id&quot;:null,&quot;id&quot;:144104758,&quot;type&quot;:&quot;newsletter&quot;,&quot;reaction_count&quot;:13,&quot;comment_count&quot;:4,&quot;publication_id&quot;:null,&quot;publication_name&quot;:&quot;Michal&#8217;s Deep Dives&quot;,&quot;publication_logo_url&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbe5ab36b-c782-420e-ac35-c7599a1f77ad_976x976.png&quot;,&quot;belowTheFold&quot;:true,&quot;youtube_url&quot;:null,&quot;show_links&quot;:null,&quot;feed_url&quot;:null}"></div><div class="digest-post-embed" data-attrs="{&quot;nodeId&quot;:&quot;471681ba-7139-442f-a0fa-bbc5f286b5db&quot;,&quot;caption&quot;:&quot;I like to keep things practical. Let&#8217;s train a simple neural network, save the model, and write an inference engine that can execute inputs against the model. Sounds like a fun time to me!&quot;,&quot;cta&quot;:null,&quot;showBylines&quot;:true,&quot;size&quot;:&quot;md&quot;,&quot;isEditorNode&quot;:true,&quot;title&quot;:&quot;Build Your Own Inference Engine: From Scratch to \&quot;7\&quot;&quot;,&quot;publishedBylines&quot;:[{&quot;id&quot;:28235731,&quot;name&quot;:&quot;Michal Pitr&quot;,&quot;bio&quot;:&quot;I write deep dives into software engineering topics.&quot;,&quot;photo_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/73a6e774-038e-424d-afee-ce5041c3e7e0_1080x1080.jpeg&quot;,&quot;is_guest&quot;:false,&quot;bestseller_tier&quot;:null}],&quot;post_date&quot;:&quot;2024-08-04T15:27:57.810Z&quot;,&quot;cover_image&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6e6e7b0d-48b1-4dc0-886c-96b2d844f2dc_1340x1106.png&quot;,&quot;cover_image_alt&quot;:null,&quot;canonical_url&quot;:&quot;https://michalpitr.substack.com/p/build-your-own-inference-engine-from&quot;,&quot;section_name&quot;:null,&quot;video_upload_id&quot;:null,&quot;id&quot;:147338023,&quot;type&quot;:&quot;newsletter&quot;,&quot;reaction_count&quot;:13,&quot;comment_count&quot;:0,&quot;publication_id&quot;:null,&quot;publication_name&quot;:&quot;Michal&#8217;s Deep Dives&quot;,&quot;publication_logo_url&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbe5ab36b-c782-420e-ac35-c7599a1f77ad_976x976.png&quot;,&quot;belowTheFold&quot;:true,&quot;youtube_url&quot;:null,&quot;show_links&quot;:null,&quot;feed_url&quot;:null}"></div>]]></content:encoded></item><item><title><![CDATA[Inference Engine: Accelerating with CUDA]]></title><description><![CDATA[Still crunching numbers, but faster.]]></description><link>https://michalpitr.substack.com/p/inference-engine-accelerating-with</link><guid isPermaLink="false">https://michalpitr.substack.com/p/inference-engine-accelerating-with</guid><dc:creator><![CDATA[Michal Pitr]]></dc:creator><pubDate>Sun, 15 Sep 2024 19:15:04 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!ZW_P!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe3080b7c-61f7-4753-bb22-de867070bcb4_1391x835.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>At the end of the <a href="https://michalpitr.substack.com/p/inference-engine-optimizing-performance">last post in this series</a>, we saw that my engine&#8217;s main bottleneck was matrix multiplication. Let&#8217;s for once use the GPU in my laptop for something other than League of Legends and accelerate the engine! By doing so, we&#8217;ll discover a couple of optimization techniques!</p><h2>Baseline</h2><p>In this post, we&#8217;ll focus on optimizing throughput: the number of inference requests processed per second. Tradeoffs between throughput and latency are pretty common. In ML applications, we often care about maximizing throughput while staying within overall latency tolerations.</p><p>First things first, we need to set up the benchmark. I&#8217;ll be using the MNIST neural network from <a href="https://michalpitr.substack.com/p/build-your-own-inference-engine-from">part 1</a> of this series. We&#8217;ll time how long it takes to process 10000 images and calculate the throughput.</p><p>CPU baseline: 430&#956;s per inference or ~2300 inferences per second.</p><p>As for hardware, I am using AMD Ryzen 5 5600H and an 80W mobile RTX 3060 with 6GB of VRAM.</p><h2>Adding CUDA execution provider</h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ZW_P!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe3080b7c-61f7-4753-bb22-de867070bcb4_1391x835.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ZW_P!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe3080b7c-61f7-4753-bb22-de867070bcb4_1391x835.png 424w, https://substackcdn.com/image/fetch/$s_!ZW_P!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe3080b7c-61f7-4753-bb22-de867070bcb4_1391x835.png 848w, https://substackcdn.com/image/fetch/$s_!ZW_P!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe3080b7c-61f7-4753-bb22-de867070bcb4_1391x835.png 1272w, https://substackcdn.com/image/fetch/$s_!ZW_P!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe3080b7c-61f7-4753-bb22-de867070bcb4_1391x835.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ZW_P!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe3080b7c-61f7-4753-bb22-de867070bcb4_1391x835.png" width="1391" height="835" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e3080b7c-61f7-4753-bb22-de867070bcb4_1391x835.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:835,&quot;width&quot;:1391,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:326381,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!ZW_P!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe3080b7c-61f7-4753-bb22-de867070bcb4_1391x835.png 424w, https://substackcdn.com/image/fetch/$s_!ZW_P!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe3080b7c-61f7-4753-bb22-de867070bcb4_1391x835.png 848w, https://substackcdn.com/image/fetch/$s_!ZW_P!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe3080b7c-61f7-4753-bb22-de867070bcb4_1391x835.png 1272w, https://substackcdn.com/image/fetch/$s_!ZW_P!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe3080b7c-61f7-4753-bb22-de867070bcb4_1391x835.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Added to compensate for the lack of diagrams in this article. Let&#8217;s not discuss how long this took to draw&#8230;</figcaption></figure></div><p></p><h3>Architecture</h3><p>My engine supports only these operations: GEMM, ReLU, Add, and flatten. Our goal is to make these also executable on a GPU. I went through multiple iterations, starting with a basic approach to validate ideas</p><pre><code>if (use_gpu) {
    gemm_cuda(...);
} else {
    gemm_cpu(...);
}</code></pre><p>and later introducing abstractions to make it easier to work with. Eventually, I settled on an approach heavily inspired by ONNX runtime.<br><br>Shared inference functionality is encapsulated in an inference_session object. It loads the model, sets its input, and iterates over the topologically sorted computational graph. Nodes of the graph are executed via plugins called execution providers.</p><p>Execution providers implement the mathematical operations such as matmul or ReLU needed to execute each node. This way each operation&#8217;s implementation can be optimized for the hardware the provider targets.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!oF7C!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe15bd375-f8c1-4868-8e14-b88909e20cd2_1738x556.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!oF7C!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe15bd375-f8c1-4868-8e14-b88909e20cd2_1738x556.png 424w, https://substackcdn.com/image/fetch/$s_!oF7C!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe15bd375-f8c1-4868-8e14-b88909e20cd2_1738x556.png 848w, https://substackcdn.com/image/fetch/$s_!oF7C!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe15bd375-f8c1-4868-8e14-b88909e20cd2_1738x556.png 1272w, https://substackcdn.com/image/fetch/$s_!oF7C!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe15bd375-f8c1-4868-8e14-b88909e20cd2_1738x556.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!oF7C!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe15bd375-f8c1-4868-8e14-b88909e20cd2_1738x556.png" width="728" height="233" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e15bd375-f8c1-4868-8e14-b88909e20cd2_1738x556.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:false,&quot;imageSize&quot;:&quot;normal&quot;,&quot;height&quot;:466,&quot;width&quot;:1456,&quot;resizeWidth&quot;:728,&quot;bytes&quot;:125371,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!oF7C!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe15bd375-f8c1-4868-8e14-b88909e20cd2_1738x556.png 424w, https://substackcdn.com/image/fetch/$s_!oF7C!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe15bd375-f8c1-4868-8e14-b88909e20cd2_1738x556.png 848w, https://substackcdn.com/image/fetch/$s_!oF7C!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe15bd375-f8c1-4868-8e14-b88909e20cd2_1738x556.png 1272w, https://substackcdn.com/image/fetch/$s_!oF7C!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe15bd375-f8c1-4868-8e14-b88909e20cd2_1738x556.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">The class definition for base execution provider.</figcaption></figure></div><p>All operations are then implemented in each provider.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!wMHl!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc8e2a960-96dd-4925-9161-c54759109638_1738x966.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!wMHl!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc8e2a960-96dd-4925-9161-c54759109638_1738x966.png 424w, https://substackcdn.com/image/fetch/$s_!wMHl!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc8e2a960-96dd-4925-9161-c54759109638_1738x966.png 848w, https://substackcdn.com/image/fetch/$s_!wMHl!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc8e2a960-96dd-4925-9161-c54759109638_1738x966.png 1272w, https://substackcdn.com/image/fetch/$s_!wMHl!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc8e2a960-96dd-4925-9161-c54759109638_1738x966.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!wMHl!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc8e2a960-96dd-4925-9161-c54759109638_1738x966.png" width="1456" height="809" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c8e2a960-96dd-4925-9161-c54759109638_1738x966.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:809,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:245996,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!wMHl!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc8e2a960-96dd-4925-9161-c54759109638_1738x966.png 424w, https://substackcdn.com/image/fetch/$s_!wMHl!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc8e2a960-96dd-4925-9161-c54759109638_1738x966.png 848w, https://substackcdn.com/image/fetch/$s_!wMHl!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc8e2a960-96dd-4925-9161-c54759109638_1738x966.png 1272w, https://substackcdn.com/image/fetch/$s_!wMHl!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc8e2a960-96dd-4925-9161-c54759109638_1738x966.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Definition of a CUDA execution provider.</figcaption></figure></div><p>Configuration is done via a yaml config file, inspired by Nvidia&#8217;s Triton Inference Server.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!tx6F!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0dc0de16-92a1-4e32-8509-e974e9b8634e_1738x742.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!tx6F!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0dc0de16-92a1-4e32-8509-e974e9b8634e_1738x742.png 424w, https://substackcdn.com/image/fetch/$s_!tx6F!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0dc0de16-92a1-4e32-8509-e974e9b8634e_1738x742.png 848w, https://substackcdn.com/image/fetch/$s_!tx6F!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0dc0de16-92a1-4e32-8509-e974e9b8634e_1738x742.png 1272w, https://substackcdn.com/image/fetch/$s_!tx6F!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0dc0de16-92a1-4e32-8509-e974e9b8634e_1738x742.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!tx6F!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0dc0de16-92a1-4e32-8509-e974e9b8634e_1738x742.png" width="1456" height="622" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/0dc0de16-92a1-4e32-8509-e974e9b8634e_1738x742.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:622,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:144581,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!tx6F!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0dc0de16-92a1-4e32-8509-e974e9b8634e_1738x742.png 424w, https://substackcdn.com/image/fetch/$s_!tx6F!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0dc0de16-92a1-4e32-8509-e974e9b8634e_1738x742.png 848w, https://substackcdn.com/image/fetch/$s_!tx6F!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0dc0de16-92a1-4e32-8509-e974e9b8634e_1738x742.png 1272w, https://substackcdn.com/image/fetch/$s_!tx6F!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0dc0de16-92a1-4e32-8509-e974e9b8634e_1738x742.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>To see the full implementation, feel free to check out the <a href="https://github.com/MichalPitr/inference_engine">GitHub repository</a>.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://michalpitr.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Michal&#8217;s Deep Dives! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><h3>Gotta go slow to go fast</h3><p>My first attempt at adding CUDA-based matrix multiplication looked something like this. This is pretty common in CUDA samples seen online but has a huge issue!</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!hv5g!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fef320991-227f-47ab-aed8-9ac5e39054a4_1738x1300.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!hv5g!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fef320991-227f-47ab-aed8-9ac5e39054a4_1738x1300.png 424w, https://substackcdn.com/image/fetch/$s_!hv5g!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fef320991-227f-47ab-aed8-9ac5e39054a4_1738x1300.png 848w, https://substackcdn.com/image/fetch/$s_!hv5g!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fef320991-227f-47ab-aed8-9ac5e39054a4_1738x1300.png 1272w, https://substackcdn.com/image/fetch/$s_!hv5g!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fef320991-227f-47ab-aed8-9ac5e39054a4_1738x1300.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!hv5g!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fef320991-227f-47ab-aed8-9ac5e39054a4_1738x1300.png" width="728" height="544.5" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ef320991-227f-47ab-aed8-9ac5e39054a4_1738x1300.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:false,&quot;imageSize&quot;:&quot;normal&quot;,&quot;height&quot;:1089,&quot;width&quot;:1456,&quot;resizeWidth&quot;:728,&quot;bytes&quot;:359266,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!hv5g!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fef320991-227f-47ab-aed8-9ac5e39054a4_1738x1300.png 424w, https://substackcdn.com/image/fetch/$s_!hv5g!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fef320991-227f-47ab-aed8-9ac5e39054a4_1738x1300.png 848w, https://substackcdn.com/image/fetch/$s_!hv5g!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fef320991-227f-47ab-aed8-9ac5e39054a4_1738x1300.png 1272w, https://substackcdn.com/image/fetch/$s_!hv5g!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fef320991-227f-47ab-aed8-9ac5e39054a4_1738x1300.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p></p><p>So, how fast is this? Before we had a throughput of 2300 inferences per second on CPU. <em><strong>*Drum roll*</strong></em> <strong>470</strong> <strong>inferences per second: ~4 times slower</strong>.</p><p>What went wrong?</p><p>For every call, this needs to allocate memory on the GPU, transfer data from CPU to GPU, run the kernel, transfer data back to CPU, and finally free GPU memory. Those calls have significant overheads, especially for cheaper operations like ReLU and add.</p><p>Until now, I was using a batch size of 1 for inference. We can amortize some of the mentioned overheads by using larger batch sizes. The chart below shows how the throughput changes as we increase the batch size from 1 to 128 for the CPU and the naive CUDA provider.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!5G9t!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa035690a-68e9-405e-b126-750b58f9e992_1600x595.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!5G9t!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa035690a-68e9-405e-b126-750b58f9e992_1600x595.png 424w, https://substackcdn.com/image/fetch/$s_!5G9t!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa035690a-68e9-405e-b126-750b58f9e992_1600x595.png 848w, https://substackcdn.com/image/fetch/$s_!5G9t!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa035690a-68e9-405e-b126-750b58f9e992_1600x595.png 1272w, https://substackcdn.com/image/fetch/$s_!5G9t!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa035690a-68e9-405e-b126-750b58f9e992_1600x595.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!5G9t!,w_2400,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa035690a-68e9-405e-b126-750b58f9e992_1600x595.png" width="1200" height="445.8791208791209" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a035690a-68e9-405e-b126-750b58f9e992_1600x595.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:false,&quot;imageSize&quot;:&quot;large&quot;,&quot;height&quot;:541,&quot;width&quot;:1456,&quot;resizeWidth&quot;:1200,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-large" alt="" srcset="https://substackcdn.com/image/fetch/$s_!5G9t!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa035690a-68e9-405e-b126-750b58f9e992_1600x595.png 424w, https://substackcdn.com/image/fetch/$s_!5G9t!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa035690a-68e9-405e-b126-750b58f9e992_1600x595.png 848w, https://substackcdn.com/image/fetch/$s_!5G9t!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa035690a-68e9-405e-b126-750b58f9e992_1600x595.png 1272w, https://substackcdn.com/image/fetch/$s_!5G9t!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa035690a-68e9-405e-b126-750b58f9e992_1600x595.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The CPU provider experiences no scaling with increased batch sizes. The CUDA provider&#8217;s throughput is 4x slower for a batch size of 1 and grows to 11x bigger at a batch size of 128.</p><p>This is encouraging! Let&#8217;s revisit the problem with data transfer overhead and see if we can do anything about performance at smaller batch sizes.</p><p></p><h3>Optimizing memory transfers</h3><p>We know that all model&#8217;s weights are fixed. Once loaded from the disk, we can move them to GPU and keep them there. For inputs and outputs, we need to tolerate some memory transfers at the start and end of each inference call.</p><p>What we cannot optimize right now is the creation of intermediate tensors during the inference loop. Those are used to store outputs when we cannot reuse one of the input tensors as the output directly. Allocating new tensors on the GPU involves a call to cudaMalloc and cudaFree for each allocation and destruction respectively and those are relatively expensive as we&#8217;ve seen. We&#8217;ll see what we can do about this in the next section!</p><p>After a few changes to my tensor class and rewriting my CUDA operations to assume data is already on the GPU, here&#8217;s our new implementation benchmarked against the previous versions.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!kpM_!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe93be8c1-dbff-4335-9e55-1b60681455c0_1600x595.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!kpM_!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe93be8c1-dbff-4335-9e55-1b60681455c0_1600x595.png 424w, https://substackcdn.com/image/fetch/$s_!kpM_!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe93be8c1-dbff-4335-9e55-1b60681455c0_1600x595.png 848w, https://substackcdn.com/image/fetch/$s_!kpM_!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe93be8c1-dbff-4335-9e55-1b60681455c0_1600x595.png 1272w, https://substackcdn.com/image/fetch/$s_!kpM_!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe93be8c1-dbff-4335-9e55-1b60681455c0_1600x595.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!kpM_!,w_2400,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe93be8c1-dbff-4335-9e55-1b60681455c0_1600x595.png" width="1200" height="445.8791208791209" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e93be8c1-dbff-4335-9e55-1b60681455c0_1600x595.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:false,&quot;imageSize&quot;:&quot;large&quot;,&quot;height&quot;:541,&quot;width&quot;:1456,&quot;resizeWidth&quot;:1200,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-large" alt="" srcset="https://substackcdn.com/image/fetch/$s_!kpM_!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe93be8c1-dbff-4335-9e55-1b60681455c0_1600x595.png 424w, https://substackcdn.com/image/fetch/$s_!kpM_!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe93be8c1-dbff-4335-9e55-1b60681455c0_1600x595.png 848w, https://substackcdn.com/image/fetch/$s_!kpM_!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe93be8c1-dbff-4335-9e55-1b60681455c0_1600x595.png 1272w, https://substackcdn.com/image/fetch/$s_!kpM_!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe93be8c1-dbff-4335-9e55-1b60681455c0_1600x595.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>We got between 12x to 4x throughput increase by keeping tensors on the GPU as much as possible. The benefits are more pronounced at smaller batch sizes where the overhead from CUDA calls is larger. Nice!</p><h2>Optimizing intermediate tensor allocations</h2><p>I noticed intermediate tensor creation and destruction was an issue by profiling the engine with perf and seeing that a lot of the execution time is spent in cudaMalloc and cudaFree calls.</p><p>If making multiple cudaMalloc and cudaFree calls is expensive, what if we pre-allocate a large chunk of GPU memory and reuse it? Newly created tensors can get a slice of memory from this memory pool and release it back to the pool when they are destroyed. This way we can completely avoid expensive cudaMalloc and cudaFree calls in the inference loop.</p><p>To make this work, tensors accept an optional allocator argument in their constructor. The allocator can internally use a memory pool without the tensor needing to know about this. Each execution provider can implement its allocator or use the default CPU allocator - a thin wrapper around malloc and free.&nbsp;</p><p>This allows us to avoid all calls to cudaMalloc and cudaFree during inference as long as we pre-allocate enough memory. Let&#8217;s see how it affects performance.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!k952!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7d545e97-c46f-4885-b9f4-09cc8812fe8e_1600x595.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!k952!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7d545e97-c46f-4885-b9f4-09cc8812fe8e_1600x595.png 424w, https://substackcdn.com/image/fetch/$s_!k952!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7d545e97-c46f-4885-b9f4-09cc8812fe8e_1600x595.png 848w, https://substackcdn.com/image/fetch/$s_!k952!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7d545e97-c46f-4885-b9f4-09cc8812fe8e_1600x595.png 1272w, https://substackcdn.com/image/fetch/$s_!k952!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7d545e97-c46f-4885-b9f4-09cc8812fe8e_1600x595.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!k952!,w_2400,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7d545e97-c46f-4885-b9f4-09cc8812fe8e_1600x595.png" width="1200" height="445.8791208791209" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/7d545e97-c46f-4885-b9f4-09cc8812fe8e_1600x595.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:false,&quot;imageSize&quot;:&quot;large&quot;,&quot;height&quot;:541,&quot;width&quot;:1456,&quot;resizeWidth&quot;:1200,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-large" alt="" srcset="https://substackcdn.com/image/fetch/$s_!k952!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7d545e97-c46f-4885-b9f4-09cc8812fe8e_1600x595.png 424w, https://substackcdn.com/image/fetch/$s_!k952!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7d545e97-c46f-4885-b9f4-09cc8812fe8e_1600x595.png 848w, https://substackcdn.com/image/fetch/$s_!k952!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7d545e97-c46f-4885-b9f4-09cc8812fe8e_1600x595.png 1272w, https://substackcdn.com/image/fetch/$s_!k952!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7d545e97-c46f-4885-b9f4-09cc8812fe8e_1600x595.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Another 2x to 1.2x increase in throughput! The optimization again has more impact on runs with smaller batch sizes.</p><h2>Results</h2><p>This was the last optimization I&#8217;ve written, so let&#8217;s see what the overall improvement is.</p><pre><code><code>+------------+------------------+----------+
|            |     Throughput   |          |
+------------+------+-----------+ Increase |
| batch size | CPU  | CUDA_pool |          |
+============+======+===========+==========+
|          1 | 2315 |     12048 |        5 |
+------------+------+-----------+----------+
|          2 | 2345 |     23529 |       10 |
+------------+------+-----------+----------+
|          4 | 2306 |     36900 |       16 |
+------------+------+-----------+----------+
|          8 | 2360 |     45454 |       19 |
+------------+------+-----------+----------+
|         16 | 2309 |     60975 |       26 |
+------------+------+-----------+----------+
|         32 | 2328 |     90909 |       39 |
+------------+------+-----------+----------+
|         64 | 2338 |    120481 |       52 |
+------------+------+-----------+----------+
|        128 | 2329 |    144927 |       62 |
+------------+------+-----------+----------+</code></code></pre><p>Pretty nice! Even at a batch size of 1, we are at 5x the CPU performance. Larger batch sizes yield higher throughput up to 60x!<br><br>Using batch sizes beyond 128 doesn&#8217;t seem to provide much benefit.</p><p>If you are curious, the current main bottleneck is the moving of results back to the CPU memory. In a future post, I&#8217;d like to explore the impact of pageable and pinned memory on this, possibly using async cuda operations where viable, and maybe adding graph optimizations to fuse kernels such as matmul followed by ReLU.</p><p>Stay tuned! </p><div><hr></div><p>Thanks for reading! Researching and writing these articles takes a lot of time and effort. To ensure you don&#8217;t miss the next one, consider subscribing or following me on <a href="https://www.linkedin.com/in/michal-pitr-a7156b127/">LinkedIn</a>.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://michalpitr.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://michalpitr.substack.com/subscribe?"><span>Subscribe now</span></a></p><p>If you enjoyed this article, you might enjoy some of my other work:</p><div class="digest-post-embed" data-attrs="{&quot;nodeId&quot;:&quot;ace2134f-1114-4017-9b21-b802dd8f57e8&quot;,&quot;caption&quot;:&quot;One of the great joys of software engineering is dispelling magic. I&#8217;ve written code that executed on a GPU using frameworks like PyTorch or TensorFlow, but I never understood the &#8220;how&#8221;. It&#8217;s time to dispel the magic of GPU programming and learn how it works under the hood.&quot;,&quot;cta&quot;:null,&quot;showBylines&quot;:true,&quot;size&quot;:&quot;sm&quot;,&quot;isEditorNode&quot;:true,&quot;title&quot;:&quot;GPU Programming&quot;,&quot;publishedBylines&quot;:[{&quot;id&quot;:28235731,&quot;name&quot;:&quot;Michal Pitr&quot;,&quot;bio&quot;:&quot;I write deep dives into software engineering topics.&quot;,&quot;photo_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/73a6e774-038e-424d-afee-ce5041c3e7e0_1080x1080.jpeg&quot;,&quot;is_guest&quot;:false,&quot;bestseller_tier&quot;:null}],&quot;post_date&quot;:&quot;2024-05-04T19:25:04.223Z&quot;,&quot;cover_image&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F60744dcf-3477-4aa5-9cdb-f020a0ba874c_2472x888.png&quot;,&quot;cover_image_alt&quot;:null,&quot;canonical_url&quot;:&quot;https://michalpitr.substack.com/p/gpu-programming&quot;,&quot;section_name&quot;:null,&quot;video_upload_id&quot;:null,&quot;id&quot;:144305968,&quot;type&quot;:&quot;newsletter&quot;,&quot;reaction_count&quot;:3,&quot;comment_count&quot;:0,&quot;publication_id&quot;:null,&quot;publication_name&quot;:&quot;Michal&#8217;s Deep Dives&quot;,&quot;publication_logo_url&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbe5ab36b-c782-420e-ac35-c7599a1f77ad_976x976.png&quot;,&quot;belowTheFold&quot;:true,&quot;youtube_url&quot;:null,&quot;show_links&quot;:null,&quot;feed_url&quot;:null}"></div><div class="digest-post-embed" data-attrs="{&quot;nodeId&quot;:&quot;1a0ef64f-562f-43d8-beec-9197e4cdf073&quot;,&quot;caption&quot;:&quot;Recently I&#8217;ve been implementing a subset of SQLite (the world&#8217;s most used database, btw) from scratch in Go. I&#8217;ll share what I&#8217;ve learned about how SQLite stores data on disk which will help us understand key database concepts. Thanks for reading Michal&#8217;s Substack! Subscribe for free to receive new posts and support my work.&quot;,&quot;cta&quot;:null,&quot;showBylines&quot;:true,&quot;size&quot;:&quot;sm&quot;,&quot;isEditorNode&quot;:true,&quot;title&quot;:&quot;How does SQLite store data?&quot;,&quot;publishedBylines&quot;:[{&quot;id&quot;:28235731,&quot;name&quot;:&quot;Michal Pitr&quot;,&quot;bio&quot;:&quot;I write deep dives into software engineering topics.&quot;,&quot;photo_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/73a6e774-038e-424d-afee-ce5041c3e7e0_1080x1080.jpeg&quot;,&quot;is_guest&quot;:false,&quot;bestseller_tier&quot;:null}],&quot;post_date&quot;:&quot;2024-03-17T16:50:48.826Z&quot;,&quot;cover_image&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc850055-defb-4bdb-8da8-779e45b482f5_997x387.png&quot;,&quot;cover_image_alt&quot;:null,&quot;canonical_url&quot;:&quot;https://michalpitr.substack.com/p/how-does-sqlite-store-data&quot;,&quot;section_name&quot;:null,&quot;video_upload_id&quot;:null,&quot;id&quot;:142692526,&quot;type&quot;:&quot;newsletter&quot;,&quot;reaction_count&quot;:18,&quot;comment_count&quot;:0,&quot;publication_id&quot;:null,&quot;publication_name&quot;:&quot;Michal&#8217;s Deep Dives&quot;,&quot;publication_logo_url&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbe5ab36b-c782-420e-ac35-c7599a1f77ad_976x976.png&quot;,&quot;belowTheFold&quot;:true,&quot;youtube_url&quot;:null,&quot;show_links&quot;:null,&quot;feed_url&quot;:null}"></div><div class="digest-post-embed" data-attrs="{&quot;nodeId&quot;:&quot;68213a40-204b-4f65-9974-a2e6c830b231&quot;,&quot;caption&quot;:&quot;Today I wanted to add graph optimizations to my inference engine, hoping for maybe a 5-10% performance improvement. Instead, I accidentally found a critical bottleneck!&quot;,&quot;cta&quot;:null,&quot;showBylines&quot;:true,&quot;size&quot;:&quot;sm&quot;,&quot;isEditorNode&quot;:true,&quot;title&quot;:&quot;Inference Engine: Optimizing Performance&quot;,&quot;publishedBylines&quot;:[{&quot;id&quot;:28235731,&quot;name&quot;:&quot;Michal Pitr&quot;,&quot;bio&quot;:&quot;I write deep dives into software engineering topics.&quot;,&quot;photo_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/73a6e774-038e-424d-afee-ce5041c3e7e0_1080x1080.jpeg&quot;,&quot;is_guest&quot;:false,&quot;bestseller_tier&quot;:null}],&quot;post_date&quot;:&quot;2024-08-17T21:28:18.454Z&quot;,&quot;cover_image&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F02518318-ad67-490e-8286-12455de1ee4f_2490x1413.png&quot;,&quot;cover_image_alt&quot;:null,&quot;canonical_url&quot;:&quot;https://michalpitr.substack.com/p/inference-engine-optimizing-performance&quot;,&quot;section_name&quot;:null,&quot;video_upload_id&quot;:null,&quot;id&quot;:147825890,&quot;type&quot;:&quot;newsletter&quot;,&quot;reaction_count&quot;:5,&quot;comment_count&quot;:0,&quot;publication_id&quot;:null,&quot;publication_name&quot;:&quot;Michal&#8217;s Deep Dives&quot;,&quot;publication_logo_url&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbe5ab36b-c782-420e-ac35-c7599a1f77ad_976x976.png&quot;,&quot;belowTheFold&quot;:true,&quot;youtube_url&quot;:null,&quot;show_links&quot;:null,&quot;feed_url&quot;:null}"></div><p></p>]]></content:encoded></item><item><title><![CDATA[Inference Engine: Optimizing Performance]]></title><description><![CDATA[How I made my inference engine 15x faster]]></description><link>https://michalpitr.substack.com/p/inference-engine-optimizing-performance</link><guid isPermaLink="false">https://michalpitr.substack.com/p/inference-engine-optimizing-performance</guid><dc:creator><![CDATA[Michal Pitr]]></dc:creator><pubDate>Sat, 17 Aug 2024 21:28:18 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!1XVg!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F02518318-ad67-490e-8286-12455de1ee4f_2490x1413.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Today I wanted to add graph optimizations to my <a href="https://michalpitr.substack.com/p/build-your-own-inference-engine-from">inference engine</a>, hoping for maybe a 5-10% performance improvement. Instead, I accidentally found a critical bottleneck! <br><br>If you haven&#8217;t already, you can read about how I wrote an inference engine from scratch here.</p><div class="digest-post-embed" data-attrs="{&quot;nodeId&quot;:&quot;09ca682b-dc23-4b40-bd80-3562ee1aae0b&quot;,&quot;caption&quot;:&quot;I like to keep things practical. Let&#8217;s train a simple neural network, save the model, and write an inference engine that can execute inputs against the model. Sounds like a fun time to me!&quot;,&quot;cta&quot;:null,&quot;showBylines&quot;:true,&quot;size&quot;:&quot;sm&quot;,&quot;isEditorNode&quot;:true,&quot;title&quot;:&quot;Build Your Own Inference Engine: From Scratch to \&quot;7\&quot;&quot;,&quot;publishedBylines&quot;:[{&quot;id&quot;:28235731,&quot;name&quot;:&quot;Michal Pitr&quot;,&quot;bio&quot;:&quot;I write deep dives into software engineering topics.&quot;,&quot;photo_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/73a6e774-038e-424d-afee-ce5041c3e7e0_1080x1080.jpeg&quot;,&quot;is_guest&quot;:false,&quot;bestseller_tier&quot;:null}],&quot;post_date&quot;:&quot;2024-08-04T15:27:57.810Z&quot;,&quot;cover_image&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6e6e7b0d-48b1-4dc0-886c-96b2d844f2dc_1340x1106.png&quot;,&quot;cover_image_alt&quot;:null,&quot;canonical_url&quot;:&quot;https://michalpitr.substack.com/p/build-your-own-inference-engine-from&quot;,&quot;section_name&quot;:null,&quot;video_upload_id&quot;:null,&quot;id&quot;:147338023,&quot;type&quot;:&quot;newsletter&quot;,&quot;reaction_count&quot;:8,&quot;comment_count&quot;:0,&quot;publication_id&quot;:null,&quot;publication_name&quot;:&quot;Michal&#8217;s Substack&quot;,&quot;publication_logo_url&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbe5ab36b-c782-420e-ac35-c7599a1f77ad_976x976.png&quot;,&quot;belowTheFold&quot;:false,&quot;youtube_url&quot;:null,&quot;show_links&quot;:null,&quot;feed_url&quot;:null}"></div><h2>Baseline benchmarking</h2><p>Improving performance is an empirical endeavor. I&#8217;ll be using a simple benchmark where the engine executes 500 sequential inference requests.</p><p>I&#8217;ll be compiling the binary in debug mode throughout this post so that the profiler has access to all debug symbols. In general, benchmarking should probably be done in release mode, but the extra readability will serve us well here.</p><p>We can time the execution with <code>time</code>.</p><pre><code>time /home/michal/code/inference_engine/build/src/engine_exe /home/michal/code/inference_engine/models/mnist_ffn_complex.onnx 

real    0m15.194s
user    0m14.500s
sys     0m0.694s</code></pre><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://michalpitr.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Michal&#8217;s Substack! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><h2>Profiling</h2><p>I&#8217;ll be using perf - a powerful profiler baked into the Linux kernel. To visualize perf&#8217;s report, I&#8217;ll use an OSS perf GUI <a href="https://github.com/KDAB/hotspot">Hotspot</a>. This will give us access to nice flame charts.</p><p>I&#8217;m looking for major bottlenecks so a sampling analysis will work just fine for me.</p><pre><code>perf record -o /home/michal/perf.data --call-graph dwarf --aio --sample-cpu /home/michal/code/inference_engine/build/src/engine_exe /home/michal/code/inference_engine/models/mnist_ffn_complex.onnx</code></pre><p></p><p></p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!1XVg!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F02518318-ad67-490e-8286-12455de1ee4f_2490x1413.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!1XVg!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F02518318-ad67-490e-8286-12455de1ee4f_2490x1413.png 424w, https://substackcdn.com/image/fetch/$s_!1XVg!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F02518318-ad67-490e-8286-12455de1ee4f_2490x1413.png 848w, https://substackcdn.com/image/fetch/$s_!1XVg!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F02518318-ad67-490e-8286-12455de1ee4f_2490x1413.png 1272w, https://substackcdn.com/image/fetch/$s_!1XVg!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F02518318-ad67-490e-8286-12455de1ee4f_2490x1413.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!1XVg!,w_2400,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F02518318-ad67-490e-8286-12455de1ee4f_2490x1413.png" width="1200" height="680.7692307692307" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/02518318-ad67-490e-8286-12455de1ee4f_2490x1413.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:false,&quot;imageSize&quot;:&quot;large&quot;,&quot;height&quot;:826,&quot;width&quot;:1456,&quot;resizeWidth&quot;:1200,&quot;bytes&quot;:321472,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-large" alt="" srcset="https://substackcdn.com/image/fetch/$s_!1XVg!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F02518318-ad67-490e-8286-12455de1ee4f_2490x1413.png 424w, https://substackcdn.com/image/fetch/$s_!1XVg!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F02518318-ad67-490e-8286-12455de1ee4f_2490x1413.png 848w, https://substackcdn.com/image/fetch/$s_!1XVg!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F02518318-ad67-490e-8286-12455de1ee4f_2490x1413.png 1272w, https://substackcdn.com/image/fetch/$s_!1XVg!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F02518318-ad67-490e-8286-12455de1ee4f_2490x1413.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Baseline results as a flame chart in Hotspot</figcaption></figure></div><p>The flame chart shows us where code spends the most CPU cycles. The core function in my engine is this <code>InferenceEngine::infer</code> call that I&#8217;ve zoomed on. It iterates over the computational graph, loads inputs, and evaluates every node.</p><p>Looking closer at the flame chart shows something concerning. Less than 40% of cycles are spent evaluating nodes. <code>InferenceEngine::prepareNodeInputs</code> is taking up a significant amount of time. There are many Tensor memory allocation calls and copy calls.</p><p>When I zoom in on the <code>InferenceEngine::evaluateNode</code> call, the story gets worse.</p><p></p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Jpss!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd9addfa8-2f06-4d59-a79a-4000abbdb001_2430x714.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Jpss!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd9addfa8-2f06-4d59-a79a-4000abbdb001_2430x714.png 424w, https://substackcdn.com/image/fetch/$s_!Jpss!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd9addfa8-2f06-4d59-a79a-4000abbdb001_2430x714.png 848w, https://substackcdn.com/image/fetch/$s_!Jpss!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd9addfa8-2f06-4d59-a79a-4000abbdb001_2430x714.png 1272w, https://substackcdn.com/image/fetch/$s_!Jpss!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd9addfa8-2f06-4d59-a79a-4000abbdb001_2430x714.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Jpss!,w_2400,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd9addfa8-2f06-4d59-a79a-4000abbdb001_2430x714.png" width="1200" height="352.74725274725273" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d9addfa8-2f06-4d59-a79a-4000abbdb001_2430x714.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:false,&quot;imageSize&quot;:&quot;large&quot;,&quot;height&quot;:428,&quot;width&quot;:1456,&quot;resizeWidth&quot;:1200,&quot;bytes&quot;:151799,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-large" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Jpss!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd9addfa8-2f06-4d59-a79a-4000abbdb001_2430x714.png 424w, https://substackcdn.com/image/fetch/$s_!Jpss!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd9addfa8-2f06-4d59-a79a-4000abbdb001_2430x714.png 848w, https://substackcdn.com/image/fetch/$s_!Jpss!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd9addfa8-2f06-4d59-a79a-4000abbdb001_2430x714.png 1272w, https://substackcdn.com/image/fetch/$s_!Jpss!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd9addfa8-2f06-4d59-a79a-4000abbdb001_2430x714.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Baseline results - zoomed in on evaluateNode</figcaption></figure></div><p>It only spends around 20% of its execution time doing useful work. The rest is spent on copying and destroying Tensors.</p><p>Let&#8217;s see if we can optimize this.</p><h2>Optimizing</h2><p>Let&#8217;s look at the <code>InferenceEngine::prepareNodeInputs</code> method.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!abns!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5f3e1b98-5a9c-4467-b099-bfd91ab13cee_1430x624.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!abns!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5f3e1b98-5a9c-4467-b099-bfd91ab13cee_1430x624.png 424w, https://substackcdn.com/image/fetch/$s_!abns!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5f3e1b98-5a9c-4467-b099-bfd91ab13cee_1430x624.png 848w, https://substackcdn.com/image/fetch/$s_!abns!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5f3e1b98-5a9c-4467-b099-bfd91ab13cee_1430x624.png 1272w, https://substackcdn.com/image/fetch/$s_!abns!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5f3e1b98-5a9c-4467-b099-bfd91ab13cee_1430x624.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!abns!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5f3e1b98-5a9c-4467-b099-bfd91ab13cee_1430x624.png" width="1430" height="624" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/5f3e1b98-5a9c-4467-b099-bfd91ab13cee_1430x624.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:624,&quot;width&quot;:1430,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:127206,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!abns!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5f3e1b98-5a9c-4467-b099-bfd91ab13cee_1430x624.png 424w, https://substackcdn.com/image/fetch/$s_!abns!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5f3e1b98-5a9c-4467-b099-bfd91ab13cee_1430x624.png 848w, https://substackcdn.com/image/fetch/$s_!abns!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5f3e1b98-5a9c-4467-b099-bfd91ab13cee_1430x624.png 1272w, https://substackcdn.com/image/fetch/$s_!abns!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5f3e1b98-5a9c-4467-b099-bfd91ab13cee_1430x624.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Inefficient prepareNodeInputs function</figcaption></figure></div><p>Here we take elements from a map and push them into a vector. Looks innocent enough but it creates a copy of each Tensor. Let&#8217;s rework this to instead return a vector of naked pointers to Tensors. While doing so, we can also slightly optimize the map access and pre-allocate a vector of appropriate size.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!tcUc!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ba62b49-b941-4cbb-9ecf-ebd73bd8bbd5_1498x810.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!tcUc!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ba62b49-b941-4cbb-9ecf-ebd73bd8bbd5_1498x810.png 424w, https://substackcdn.com/image/fetch/$s_!tcUc!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ba62b49-b941-4cbb-9ecf-ebd73bd8bbd5_1498x810.png 848w, https://substackcdn.com/image/fetch/$s_!tcUc!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ba62b49-b941-4cbb-9ecf-ebd73bd8bbd5_1498x810.png 1272w, https://substackcdn.com/image/fetch/$s_!tcUc!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ba62b49-b941-4cbb-9ecf-ebd73bd8bbd5_1498x810.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!tcUc!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ba62b49-b941-4cbb-9ecf-ebd73bd8bbd5_1498x810.png" width="1456" height="787" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/5ba62b49-b941-4cbb-9ecf-ebd73bd8bbd5_1498x810.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:787,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:162429,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!tcUc!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ba62b49-b941-4cbb-9ecf-ebd73bd8bbd5_1498x810.png 424w, https://substackcdn.com/image/fetch/$s_!tcUc!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ba62b49-b941-4cbb-9ecf-ebd73bd8bbd5_1498x810.png 848w, https://substackcdn.com/image/fetch/$s_!tcUc!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ba62b49-b941-4cbb-9ecf-ebd73bd8bbd5_1498x810.png 1272w, https://substackcdn.com/image/fetch/$s_!tcUc!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ba62b49-b941-4cbb-9ecf-ebd73bd8bbd5_1498x810.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Optimized prepareNodeInputs function</figcaption></figure></div><p>In the flame chart, we saw a bunch of Tensor copies, so I&#8217;ve also gone ahead and used pointers or references where applicable in the rest of the <code>InferenceEngine::infer</code> method to minimize the number of unnecessary copies.</p><h2>Results</h2><p>To see if our previous changes made a material difference we need to benchmark again:</p><pre><code>real    0m1.053s
user    0m1.040s
sys     0m0.013s</code></pre><p>Wow, 15x faster just like that.</p><p>Let&#8217;s also re-run the profiler to see what proportion of time is spent on useful operations now.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!zRPt!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0ed3e7da-7167-4edb-9976-7afc42ea4467_2430x714.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!zRPt!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0ed3e7da-7167-4edb-9976-7afc42ea4467_2430x714.png 424w, https://substackcdn.com/image/fetch/$s_!zRPt!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0ed3e7da-7167-4edb-9976-7afc42ea4467_2430x714.png 848w, https://substackcdn.com/image/fetch/$s_!zRPt!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0ed3e7da-7167-4edb-9976-7afc42ea4467_2430x714.png 1272w, https://substackcdn.com/image/fetch/$s_!zRPt!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0ed3e7da-7167-4edb-9976-7afc42ea4467_2430x714.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!zRPt!,w_2400,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0ed3e7da-7167-4edb-9976-7afc42ea4467_2430x714.png" width="1200" height="352.74725274725273" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/0ed3e7da-7167-4edb-9976-7afc42ea4467_2430x714.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:false,&quot;imageSize&quot;:&quot;large&quot;,&quot;height&quot;:428,&quot;width&quot;:1456,&quot;resizeWidth&quot;:1200,&quot;bytes&quot;:76890,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-large" alt="" srcset="https://substackcdn.com/image/fetch/$s_!zRPt!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0ed3e7da-7167-4edb-9976-7afc42ea4467_2430x714.png 424w, https://substackcdn.com/image/fetch/$s_!zRPt!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0ed3e7da-7167-4edb-9976-7afc42ea4467_2430x714.png 848w, https://substackcdn.com/image/fetch/$s_!zRPt!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0ed3e7da-7167-4edb-9976-7afc42ea4467_2430x714.png 1272w, https://substackcdn.com/image/fetch/$s_!zRPt!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0ed3e7da-7167-4edb-9976-7afc42ea4467_2430x714.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Flame chart with optimizations applied</figcaption></figure></div><p>As we can see, now <code>InferenceEngine::evaluateNode</code> constitutes the vast majority of the infer call. Most time is spent on the most complex operation my engine currently supports - generalized matrix multiplication. That&#8217;s great!</p><p>To overcome the performance limitations imposed by my current GEMM implementation, I plan to explore either offloading the computation to a GPU using CUDA or distributing the workload across multiple CPU cores. </p><p>With this in mind, I think I&#8217;ll hold off on graph optimizations as they likely won&#8217;t provide a significant improvement compared to other optimizations.</p><div><hr></div><p><br>I hope you enjoyed this lighter post on performance programming! If you did, consider subscribing and/or following me on <a href="https://www.linkedin.com/in/michal-pitr-a7156b127/">LinkedIn</a>!</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://michalpitr.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://michalpitr.substack.com/subscribe?"><span>Subscribe now</span></a></p><p>You might also enjoy some of my other posts. Maybe check out one of the ones linked below?</p><div class="digest-post-embed" data-attrs="{&quot;nodeId&quot;:&quot;cd94e5dd-d428-4feb-9226-9328c7affb49&quot;,&quot;caption&quot;:&quot;Recently I&#8217;ve been implementing a subset of SQLite (the world&#8217;s most used database, btw) from scratch in Go. I&#8217;ll share what I&#8217;ve learned about how SQLite stores data on disk which will help us understand key database concepts. Thanks for reading Michal&#8217;s Substack! Subscribe for free to receive new posts and support my work.&quot;,&quot;cta&quot;:null,&quot;showBylines&quot;:true,&quot;size&quot;:&quot;sm&quot;,&quot;isEditorNode&quot;:true,&quot;title&quot;:&quot;How does SQLite store data?&quot;,&quot;publishedBylines&quot;:[{&quot;id&quot;:28235731,&quot;name&quot;:&quot;Michal Pitr&quot;,&quot;bio&quot;:&quot;I write deep dives into software engineering topics.&quot;,&quot;photo_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/73a6e774-038e-424d-afee-ce5041c3e7e0_1080x1080.jpeg&quot;,&quot;is_guest&quot;:false,&quot;bestseller_tier&quot;:null}],&quot;post_date&quot;:&quot;2024-03-17T16:50:48.826Z&quot;,&quot;cover_image&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc850055-defb-4bdb-8da8-779e45b482f5_997x387.png&quot;,&quot;cover_image_alt&quot;:null,&quot;canonical_url&quot;:&quot;https://michalpitr.substack.com/p/how-does-sqlite-store-data&quot;,&quot;section_name&quot;:null,&quot;video_upload_id&quot;:null,&quot;id&quot;:142692526,&quot;type&quot;:&quot;newsletter&quot;,&quot;reaction_count&quot;:16,&quot;comment_count&quot;:0,&quot;publication_id&quot;:null,&quot;publication_name&quot;:&quot;Michal&#8217;s Substack&quot;,&quot;publication_logo_url&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbe5ab36b-c782-420e-ac35-c7599a1f77ad_976x976.png&quot;,&quot;belowTheFold&quot;:true,&quot;youtube_url&quot;:null,&quot;show_links&quot;:null,&quot;feed_url&quot;:null}"></div><div class="digest-post-embed" data-attrs="{&quot;nodeId&quot;:&quot;38255e8d-8171-4710-a4d4-e313fffa5fae&quot;,&quot;caption&quot;:&quot;One of the great joys of software engineering is dispelling magic. I&#8217;ve written code that executed on a GPU using frameworks like PyTorch or TensorFlow, but I never understood the &#8220;how&#8221;. It&#8217;s time to dispel the magic of GPU programming and learn how it works under the hood.&quot;,&quot;cta&quot;:null,&quot;showBylines&quot;:true,&quot;size&quot;:&quot;sm&quot;,&quot;isEditorNode&quot;:true,&quot;title&quot;:&quot;GPU Programming&quot;,&quot;publishedBylines&quot;:[{&quot;id&quot;:28235731,&quot;name&quot;:&quot;Michal Pitr&quot;,&quot;bio&quot;:&quot;I write deep dives into software engineering topics.&quot;,&quot;photo_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/73a6e774-038e-424d-afee-ce5041c3e7e0_1080x1080.jpeg&quot;,&quot;is_guest&quot;:false,&quot;bestseller_tier&quot;:null}],&quot;post_date&quot;:&quot;2024-05-04T19:25:04.223Z&quot;,&quot;cover_image&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F60744dcf-3477-4aa5-9cdb-f020a0ba874c_2472x888.png&quot;,&quot;cover_image_alt&quot;:null,&quot;canonical_url&quot;:&quot;https://michalpitr.substack.com/p/gpu-programming&quot;,&quot;section_name&quot;:null,&quot;video_upload_id&quot;:null,&quot;id&quot;:144305968,&quot;type&quot;:&quot;newsletter&quot;,&quot;reaction_count&quot;:3,&quot;comment_count&quot;:0,&quot;publication_id&quot;:null,&quot;publication_name&quot;:&quot;Michal&#8217;s Substack&quot;,&quot;publication_logo_url&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbe5ab36b-c782-420e-ac35-c7599a1f77ad_976x976.png&quot;,&quot;belowTheFold&quot;:true,&quot;youtube_url&quot;:null,&quot;show_links&quot;:null,&quot;feed_url&quot;:null}"></div><div class="digest-post-embed" data-attrs="{&quot;nodeId&quot;:&quot;f60f5805-48c3-4f2e-9e2c-be3ddbc81f58&quot;,&quot;caption&quot;:&quot;Over the last couple of weeks, I&#8217;ve been building MapReduce from scratch.&quot;,&quot;cta&quot;:null,&quot;showBylines&quot;:true,&quot;size&quot;:&quot;sm&quot;,&quot;isEditorNode&quot;:true,&quot;title&quot;:&quot;MapReduce from Scratch&quot;,&quot;publishedBylines&quot;:[{&quot;id&quot;:28235731,&quot;name&quot;:&quot;Michal Pitr&quot;,&quot;bio&quot;:&quot;I write deep dives into software engineering topics.&quot;,&quot;photo_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/73a6e774-038e-424d-afee-ce5041c3e7e0_1080x1080.jpeg&quot;,&quot;is_guest&quot;:false,&quot;bestseller_tier&quot;:null}],&quot;post_date&quot;:&quot;2024-04-28T21:33:39.100Z&quot;,&quot;cover_image&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc1d3a7a5-e2cc-4778-bafe-03f87e6a6884_1600x671.png&quot;,&quot;cover_image_alt&quot;:null,&quot;canonical_url&quot;:&quot;https://michalpitr.substack.com/p/mapreduce-from-scratch&quot;,&quot;section_name&quot;:null,&quot;video_upload_id&quot;:null,&quot;id&quot;:144104758,&quot;type&quot;:&quot;newsletter&quot;,&quot;reaction_count&quot;:10,&quot;comment_count&quot;:4,&quot;publication_id&quot;:null,&quot;publication_name&quot;:&quot;Michal&#8217;s Substack&quot;,&quot;publication_logo_url&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbe5ab36b-c782-420e-ac35-c7599a1f77ad_976x976.png&quot;,&quot;belowTheFold&quot;:true,&quot;youtube_url&quot;:null,&quot;show_links&quot;:null,&quot;feed_url&quot;:null}"></div><p></p>]]></content:encoded></item><item><title><![CDATA[Build Your Own Inference Engine: From Scratch to "7"]]></title><description><![CDATA[Building a C++ Inference Engine from scratch]]></description><link>https://michalpitr.substack.com/p/build-your-own-inference-engine-from</link><guid isPermaLink="false">https://michalpitr.substack.com/p/build-your-own-inference-engine-from</guid><dc:creator><![CDATA[Michal Pitr]]></dc:creator><pubDate>Sun, 04 Aug 2024 15:27:57 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6e6e7b0d-48b1-4dc0-886c-96b2d844f2dc_1340x1106.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>I like to keep things practical. Let&#8217;s train a simple neural network, save the model, and write an inference engine that can execute inputs against the model. Sounds like a fun time to me!</p><h1>Training a model</h1><p>Before we can serve a model, we need to train one. We&#8217;ll be using the model illustrated below.</p><p></p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!xkRs!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Facf0a0fa-b3b9-4d1b-935c-b95df6bf6aa5_1836x1660.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!xkRs!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Facf0a0fa-b3b9-4d1b-935c-b95df6bf6aa5_1836x1660.png 424w, https://substackcdn.com/image/fetch/$s_!xkRs!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Facf0a0fa-b3b9-4d1b-935c-b95df6bf6aa5_1836x1660.png 848w, https://substackcdn.com/image/fetch/$s_!xkRs!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Facf0a0fa-b3b9-4d1b-935c-b95df6bf6aa5_1836x1660.png 1272w, https://substackcdn.com/image/fetch/$s_!xkRs!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Facf0a0fa-b3b9-4d1b-935c-b95df6bf6aa5_1836x1660.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!xkRs!,w_2400,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Facf0a0fa-b3b9-4d1b-935c-b95df6bf6aa5_1836x1660.png" width="954" height="862.2692307692307" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/acf0a0fa-b3b9-4d1b-935c-b95df6bf6aa5_1836x1660.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:false,&quot;imageSize&quot;:&quot;large&quot;,&quot;height&quot;:1316,&quot;width&quot;:1456,&quot;resizeWidth&quot;:954,&quot;bytes&quot;:162730,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-large" alt="" srcset="https://substackcdn.com/image/fetch/$s_!xkRs!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Facf0a0fa-b3b9-4d1b-935c-b95df6bf6aa5_1836x1660.png 424w, https://substackcdn.com/image/fetch/$s_!xkRs!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Facf0a0fa-b3b9-4d1b-935c-b95df6bf6aa5_1836x1660.png 848w, https://substackcdn.com/image/fetch/$s_!xkRs!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Facf0a0fa-b3b9-4d1b-935c-b95df6bf6aa5_1836x1660.png 1272w, https://substackcdn.com/image/fetch/$s_!xkRs!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Facf0a0fa-b3b9-4d1b-935c-b95df6bf6aa5_1836x1660.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Model for MNIST digit classification.</figcaption></figure></div><p>This model has some nice features: it&#8217;s easy to train, has a non-trivial topology, and only requires 4 operations: Flatten, Gemm, ReLU, and Add. Gemm stands for generalized matrix multiplication. I think the others are self-explanatory.</p><p>We want to save the trained model in <a href="https://onnx.ai/index.html">ONNX format</a>. ONNX is a standard format for saving models for interoperability between different ML frameworks. Since I probably won&#8217;t  be adding support for other model formats, this is a solid default choice.</p><p>You can see all the code related to training the model on<a href="https://github.com/MichalPitr/mnist/blob/main/nn_complex.py"> my github</a>.</p><pre><code>(venv) michal@michal-lg:~/code/mnist$ python nn_complex.py 
Net(
  (fc1): Linear(in_features=784, out_features=512, bias=True)
  (fc2_left): Linear(in_features=512, out_features=200, bias=True)
  (fc2_left2): Linear(in_features=200, out_features=100, bias=True)
  (fc2_right): Linear(in_features=512, out_features=100, bias=True)
  (fc3): Linear(in_features=100, out_features=10, bias=True)
)
Epoch 1 - Test loss: 0.0006, Accuracy: 84.77%
Epoch 2 - Test loss: 0.0004, Accuracy: 89.66%
Epoch 3 - Test loss: 0.0003, Accuracy: 90.73%
Epoch 4 - Test loss: 0.0003, Accuracy: 91.91%
Epoch 5 - Test loss: 0.0003, Accuracy: 92.58%
Epoch 6 - Test loss: 0.0002, Accuracy: 93.31%
Epoch 7 - Test loss: 0.0002, Accuracy: 93.59%
Epoch 8 - Test loss: 0.0002, Accuracy: 94.02%
Epoch 9 - Test loss: 0.0002, Accuracy: 94.23%
Epoch 10 - Test loss: 0.0002, Accuracy: 94.65%
Model saved as mnist_ffn_complex.onnx</code></pre><p>With the model trained, let&#8217;s learn more about inference engines.</p><h1>Why inference engines matter</h1><p>Before designing the engine, let&#8217;s discuss why we even want one in the first place. Couldn&#8217;t we just reuse the same ML training framework we used to train the model?</p><p>With LLMs going mainstream, an interesting observation was made - over the lifetime of a model, serving can be more expensive than training. So it makes sense to have specialized tools optimized for inference specifically.</p><p>Inference servers are software for managing deployment, lifetime, and serving-related optimizations of already trained models. Popular inference servers include Nvidia&#8217;s <a href="https://github.com/triton-inference-server/server?tab=readme-ov-file">Triton Inference Server</a> or Google&#8217;s <a href="https://github.com/tensorflow/serving">TensorFlow Serving</a>.</p><p>Inference servers balance throughput and latency. Throughput is often optimized through dynamic batching - waiting for inference requests to accumulate before handing them off to the inference engine. This improves hardware utilization at the cost of increased latency for some requests.&nbsp;</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Lzhz!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6e6e7b0d-48b1-4dc0-886c-96b2d844f2dc_1340x1106.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Lzhz!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6e6e7b0d-48b1-4dc0-886c-96b2d844f2dc_1340x1106.png 424w, https://substackcdn.com/image/fetch/$s_!Lzhz!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6e6e7b0d-48b1-4dc0-886c-96b2d844f2dc_1340x1106.png 848w, https://substackcdn.com/image/fetch/$s_!Lzhz!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6e6e7b0d-48b1-4dc0-886c-96b2d844f2dc_1340x1106.png 1272w, https://substackcdn.com/image/fetch/$s_!Lzhz!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6e6e7b0d-48b1-4dc0-886c-96b2d844f2dc_1340x1106.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Lzhz!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6e6e7b0d-48b1-4dc0-886c-96b2d844f2dc_1340x1106.png" width="1340" height="1106" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/6e6e7b0d-48b1-4dc0-886c-96b2d844f2dc_1340x1106.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1106,&quot;width&quot;:1340,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:239991,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Lzhz!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6e6e7b0d-48b1-4dc0-886c-96b2d844f2dc_1340x1106.png 424w, https://substackcdn.com/image/fetch/$s_!Lzhz!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6e6e7b0d-48b1-4dc0-886c-96b2d844f2dc_1340x1106.png 848w, https://substackcdn.com/image/fetch/$s_!Lzhz!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6e6e7b0d-48b1-4dc0-886c-96b2d844f2dc_1340x1106.png 1272w, https://substackcdn.com/image/fetch/$s_!Lzhz!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6e6e7b0d-48b1-4dc0-886c-96b2d844f2dc_1340x1106.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Illustration of an inference engine and its components.</figcaption></figure></div><p>The inference engine, the subject of our discussion, then efficiently executes the model with provided inputs. To achieve high performance, inference engines employ a range of optimizations:</p><ul><li><p>Hardware acceleration</p></li><li><p>Efficient memory management</p></li><li><p>Graph optimizations</p></li><li><p>Quantization to reduce numeric precision while maintaining accuracy</p></li></ul><p>I want to implement graph optimizations and GPU acceleration in follow-up posts. Consider subscribing to get an email when I publish the next post. </p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://michalpitr.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Michal&#8217;s Substack! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p>For now, let&#8217;s stick with CPU inference.</p><h1>Inference engine from scratch</h1><p>Let&#8217;s outline the steps that our engine will do:</p><ul><li><p>Load the model</p></li><li><p>Construct a graph representation of the model</p></li><li><p>Topologically sort nodes</p></li><li><p>Run inference with user inputs</p></li></ul><h2>Loading the model</h2><p>Luckily for us, ONNX models are saved in Protobuf format. This means we can download the <a href="https://github.com/ankane/onnxruntime-1/blob/master/onnxruntime/core/protobuf/onnx-ml.proto">onnx-ml.proto</a> and generate a client library for interacting with ONNX files. This will also be our only external dependency - sticking with just the standard lib from now on.</p><p>Once the model is loaded, we can extract the weights into a Tensor object. Tensor is a thin wrapper around std::vector&lt;T&gt; where elements are stored in row-major order.</p><h3>Graph construction</h3><p>In this part, we iterate over all nodes in the ONNX model, extract each into a minimal Node representation, and store them in an adjacency list in a Graph object.</p><p>Nodes simply define the operation and list the names of input and output tensors.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!d7Bk!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F268539f3-e5de-41f7-8956-9252ee8c8509_800x235.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!d7Bk!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F268539f3-e5de-41f7-8956-9252ee8c8509_800x235.png 424w, https://substackcdn.com/image/fetch/$s_!d7Bk!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F268539f3-e5de-41f7-8956-9252ee8c8509_800x235.png 848w, https://substackcdn.com/image/fetch/$s_!d7Bk!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F268539f3-e5de-41f7-8956-9252ee8c8509_800x235.png 1272w, https://substackcdn.com/image/fetch/$s_!d7Bk!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F268539f3-e5de-41f7-8956-9252ee8c8509_800x235.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!d7Bk!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F268539f3-e5de-41f7-8956-9252ee8c8509_800x235.png" width="547" height="160.68125" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/268539f3-e5de-41f7-8956-9252ee8c8509_800x235.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:235,&quot;width&quot;:800,&quot;resizeWidth&quot;:547,&quot;bytes&quot;:55467,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!d7Bk!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F268539f3-e5de-41f7-8956-9252ee8c8509_800x235.png 424w, https://substackcdn.com/image/fetch/$s_!d7Bk!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F268539f3-e5de-41f7-8956-9252ee8c8509_800x235.png 848w, https://substackcdn.com/image/fetch/$s_!d7Bk!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F268539f3-e5de-41f7-8956-9252ee8c8509_800x235.png 1272w, https://substackcdn.com/image/fetch/$s_!d7Bk!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F268539f3-e5de-41f7-8956-9252ee8c8509_800x235.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p></p><p>When we construct the graph, we add nodes 1-by-1. When a new node is added, we check if any existing nodes are the parents or children of the new node by comparing input and output tensor names.</p><p>Now that we have a graph, we have to figure out how to execute it.</p><h3>Topological sorting</h3><p>I&#8217;ll assume that our graphs are non-cyclic, i.e. we are dealing with DAGs. Still, we need to be careful to execute nodes in such an order that all intermediate results are ready when the node is executed.<br><br>If we didn&#8217;t pay special attention to this, we would likely run into a scenario like the one illustrated in the animation below. There we tried to execute &#8220;Add&#8221; before the results of the left branch were ready.</p><p></p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!d5Pe!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F282a3534-40e2-47ce-9c54-05034907a8da_543x1234.gif" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!d5Pe!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F282a3534-40e2-47ce-9c54-05034907a8da_543x1234.gif 424w, https://substackcdn.com/image/fetch/$s_!d5Pe!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F282a3534-40e2-47ce-9c54-05034907a8da_543x1234.gif 848w, https://substackcdn.com/image/fetch/$s_!d5Pe!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F282a3534-40e2-47ce-9c54-05034907a8da_543x1234.gif 1272w, https://substackcdn.com/image/fetch/$s_!d5Pe!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F282a3534-40e2-47ce-9c54-05034907a8da_543x1234.gif 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!d5Pe!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F282a3534-40e2-47ce-9c54-05034907a8da_543x1234.gif" width="355" height="806.7587476979742" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/282a3534-40e2-47ce-9c54-05034907a8da_543x1234.gif&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1234,&quot;width&quot;:543,&quot;resizeWidth&quot;:355,&quot;bytes&quot;:107095,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/gif&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!d5Pe!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F282a3534-40e2-47ce-9c54-05034907a8da_543x1234.gif 424w, https://substackcdn.com/image/fetch/$s_!d5Pe!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F282a3534-40e2-47ce-9c54-05034907a8da_543x1234.gif 848w, https://substackcdn.com/image/fetch/$s_!d5Pe!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F282a3534-40e2-47ce-9c54-05034907a8da_543x1234.gif 1272w, https://substackcdn.com/image/fetch/$s_!d5Pe!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F282a3534-40e2-47ce-9c54-05034907a8da_543x1234.gif 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Trying to execute &#8220;Add&#8221; before the output of left ReLU is ready.</figcaption></figure></div><p></p><p>Instead, we want something closer to this.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ERIX!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F780d1ac6-73e5-43e8-b7f3-438584551ea1_543x1234.gif" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ERIX!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F780d1ac6-73e5-43e8-b7f3-438584551ea1_543x1234.gif 424w, https://substackcdn.com/image/fetch/$s_!ERIX!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F780d1ac6-73e5-43e8-b7f3-438584551ea1_543x1234.gif 848w, https://substackcdn.com/image/fetch/$s_!ERIX!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F780d1ac6-73e5-43e8-b7f3-438584551ea1_543x1234.gif 1272w, https://substackcdn.com/image/fetch/$s_!ERIX!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F780d1ac6-73e5-43e8-b7f3-438584551ea1_543x1234.gif 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ERIX!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F780d1ac6-73e5-43e8-b7f3-438584551ea1_543x1234.gif" width="353" height="802.2136279926335" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/780d1ac6-73e5-43e8-b7f3-438584551ea1_543x1234.gif&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1234,&quot;width&quot;:543,&quot;resizeWidth&quot;:353,&quot;bytes&quot;:220449,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/gif&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!ERIX!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F780d1ac6-73e5-43e8-b7f3-438584551ea1_543x1234.gif 424w, https://substackcdn.com/image/fetch/$s_!ERIX!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F780d1ac6-73e5-43e8-b7f3-438584551ea1_543x1234.gif 848w, https://substackcdn.com/image/fetch/$s_!ERIX!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F780d1ac6-73e5-43e8-b7f3-438584551ea1_543x1234.gif 1272w, https://substackcdn.com/image/fetch/$s_!ERIX!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F780d1ac6-73e5-43e8-b7f3-438584551ea1_543x1234.gif 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Executing the model in topological order.</figcaption></figure></div><p>If you are already thinking if we could execute those two branches in parallel, then you are on the right track! We won&#8217;t go into that in this post as I haven&#8217;t implemented that logic yet, but it&#8217;s something I&#8217;d like to explore down the line.</p><p>So how do we get this order? We can use trusty topological sort which can be concisely implemented with depth-first search. Since we assume the model is static, it&#8217;s sufficient to compute the order once when the model first loads.</p><p>For those interested in seeing this implemented, you can read it <a href="https://github.com/MichalPitr/inference_engine/blob/main/src/graph.cpp#L67-L131">here</a>.</p><h3>Inference</h3><p>We&#8217;ve loaded the model, extracted its weights, constructed a graph, and sorted the graph&#8217;s node. We are all set for inference. I&#8217;ll skip input loading as it&#8217;s not particularly interesting. For our MNIST example, every input is a Tensor&lt;uint8&gt;(28, 28) representing a black-and-white image.</p><p>The infer() call iterates over topologically sorted nodes, for each node, it does the following:</p><ul><li><p>Read input names used by the node.</p></li><li><p>Read inputs from a Tensor store into a vector of inputs.</p></li><li><p>Based on the node&#8217;s operation type, read additional input information, things as whether a matrix is transposed in Gemm.</p></li><li><p>Pass inputs to a corresponding operator function.</p></li><li><p>Save output to Tensor store or print if it&#8217;s the final result.</p></li></ul><p>Here&#8217;s part of the method to get the main idea. You can see the full source code on <a href="https://github.com/MichalPitr/inference_engine/blob/main/src/inference_engine.cpp">github</a>. Each case simply prepares inputs and passes them to the operator function.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!hKbL!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F261c51c2-23f8-42bc-bd5a-fb32038c446a_1738x1710.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!hKbL!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F261c51c2-23f8-42bc-bd5a-fb32038c446a_1738x1710.png 424w, https://substackcdn.com/image/fetch/$s_!hKbL!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F261c51c2-23f8-42bc-bd5a-fb32038c446a_1738x1710.png 848w, https://substackcdn.com/image/fetch/$s_!hKbL!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F261c51c2-23f8-42bc-bd5a-fb32038c446a_1738x1710.png 1272w, https://substackcdn.com/image/fetch/$s_!hKbL!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F261c51c2-23f8-42bc-bd5a-fb32038c446a_1738x1710.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!hKbL!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F261c51c2-23f8-42bc-bd5a-fb32038c446a_1738x1710.png" width="727" height="715.5157967032967" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/261c51c2-23f8-42bc-bd5a-fb32038c446a_1738x1710.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1433,&quot;width&quot;:1456,&quot;resizeWidth&quot;:727,&quot;bytes&quot;:390143,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!hKbL!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F261c51c2-23f8-42bc-bd5a-fb32038c446a_1738x1710.png 424w, https://substackcdn.com/image/fetch/$s_!hKbL!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F261c51c2-23f8-42bc-bd5a-fb32038c446a_1738x1710.png 848w, https://substackcdn.com/image/fetch/$s_!hKbL!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F261c51c2-23f8-42bc-bd5a-fb32038c446a_1738x1710.png 1272w, https://substackcdn.com/image/fetch/$s_!hKbL!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F261c51c2-23f8-42bc-bd5a-fb32038c446a_1738x1710.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Code listing of a part of infer method.</figcaption></figure></div><p>The operator functions are implemented in <a href="https://github.com/MichalPitr/inference_engine/blob/main/src/operators.cpp">operators.cpp</a>. Since I decided to only support 4 operations, it&#8217;s not too bad to implement them from scratch. We are giving up some performance here, but I&#8217;d like to explore C++ profiling tooling in a follow-up post anyway. Especially around things like memory access and cache locality.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Sqqw!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7399b9-1825-4f1f-9133-c7ad3a54ca3a_84x84.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Sqqw!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7399b9-1825-4f1f-9133-c7ad3a54ca3a_84x84.png 424w, https://substackcdn.com/image/fetch/$s_!Sqqw!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7399b9-1825-4f1f-9133-c7ad3a54ca3a_84x84.png 848w, https://substackcdn.com/image/fetch/$s_!Sqqw!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7399b9-1825-4f1f-9133-c7ad3a54ca3a_84x84.png 1272w, https://substackcdn.com/image/fetch/$s_!Sqqw!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7399b9-1825-4f1f-9133-c7ad3a54ca3a_84x84.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Sqqw!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7399b9-1825-4f1f-9133-c7ad3a54ca3a_84x84.png" width="200" height="200" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4a7399b9-1825-4f1f-9133-c7ad3a54ca3a_84x84.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:84,&quot;width&quot;:84,&quot;resizeWidth&quot;:200,&quot;bytes&quot;:6952,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Sqqw!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7399b9-1825-4f1f-9133-c7ad3a54ca3a_84x84.png 424w, https://substackcdn.com/image/fetch/$s_!Sqqw!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7399b9-1825-4f1f-9133-c7ad3a54ca3a_84x84.png 848w, https://substackcdn.com/image/fetch/$s_!Sqqw!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7399b9-1825-4f1f-9133-c7ad3a54ca3a_84x84.png 1272w, https://substackcdn.com/image/fetch/$s_!Sqqw!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7399b9-1825-4f1f-9133-c7ad3a54ca3a_84x84.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">Upscaled image of number 7</figcaption></figure></div><p>And that&#8217;s pretty much it. Let&#8217;s run the inference engine on this image of number 7.</p><pre><code>michal@michal-lg:~/code/inference_engine$ /home/michal/code/inference_engine/build/src/engine_exe /home/michal/code/inference_engine/models/mnist_ffn_complex.onnx /home/michal/code/inference_engine/inputs/image_0.ubyte
Out: Tensor((1, 10)[[407.129, -1327.89, 827.717, 1137.59, -1497.12, -73.3868, -2284.66, 2266.74, 1.9645, 475.585]])</code></pre><p>To determine the model&#8217;s prediction, we take the argmax of the output. Here the 7th (0 indexed) output is the largest, so the model correctly predicts the image is a 7!</p><p>Not bad for ~2000 lines of C++.</p><p>Thanks for reading! We&#8217;ve covered how inference engines work. Now that we have a minimal engine, I want to extend it. Let me know in the comments which improvements you&#8217;d like to see!</p><div><hr></div><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://michalpitr.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Michal&#8217;s Substack! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p>I hope you enjoyed this deep dive into inference engines! If you did, consider subscribing and/or following me on <a href="https://www.linkedin.com/in/michal-pitr-a7156b127/">LinkedIn</a>. </p><p>You might also enjoy some of my other posts.  Maybe my deep dive into <a href="https://michalpitr.substack.com/p/how-does-sqlite-store-data">SQLite storage format </a>or implementation of <a href="https://michalpitr.substack.com/p/mapreduce-from-scratch">MapReduce from scratch</a>?</p>]]></content:encoded></item><item><title><![CDATA[GPU Programming]]></title><description><![CDATA[Writing code for massively parallel processors]]></description><link>https://michalpitr.substack.com/p/gpu-programming</link><guid isPermaLink="false">https://michalpitr.substack.com/p/gpu-programming</guid><dc:creator><![CDATA[Michal Pitr]]></dc:creator><pubDate>Sat, 04 May 2024 19:25:04 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!ksVf!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F60744dcf-3477-4aa5-9cdb-f020a0ba874c_2472x888.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>One of the great joys of software engineering is dispelling magic. I&#8217;ve written code that executed on a GPU using frameworks like PyTorch or TensorFlow, but I never understood the &#8220;how&#8221;. It&#8217;s time to dispel the magic of GPU programming and learn how it works under the hood.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://michalpitr.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Michal&#8217;s Substack! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><h2>C CUDA basics</h2><p>C CUDA is Nvidia&#8217;s extension of ANSI C. For the most part, it is the same as C with some added syntax and built-in functions. C CUDA gives us control over what parts of our code are executed on the CPU and the GPU. We call code executed on the CPU host code and GPU code device code. Procedures that run on the GPU are for historical reasons called kernels.</p><p>Instead of focusing on CUDA itself, let&#8217;s write a simple program that blurs an image. I&#8217;ll try to fill in the details as needed.</p><h2>Blurring images</h2><p>We want to write code to blur an image on a GPU.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ksVf!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F60744dcf-3477-4aa5-9cdb-f020a0ba874c_2472x888.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ksVf!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F60744dcf-3477-4aa5-9cdb-f020a0ba874c_2472x888.png 424w, https://substackcdn.com/image/fetch/$s_!ksVf!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F60744dcf-3477-4aa5-9cdb-f020a0ba874c_2472x888.png 848w, https://substackcdn.com/image/fetch/$s_!ksVf!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F60744dcf-3477-4aa5-9cdb-f020a0ba874c_2472x888.png 1272w, https://substackcdn.com/image/fetch/$s_!ksVf!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F60744dcf-3477-4aa5-9cdb-f020a0ba874c_2472x888.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ksVf!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F60744dcf-3477-4aa5-9cdb-f020a0ba874c_2472x888.png" width="1456" height="523" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/60744dcf-3477-4aa5-9cdb-f020a0ba874c_2472x888.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:523,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:3355965,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!ksVf!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F60744dcf-3477-4aa5-9cdb-f020a0ba874c_2472x888.png 424w, https://substackcdn.com/image/fetch/$s_!ksVf!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F60744dcf-3477-4aa5-9cdb-f020a0ba874c_2472x888.png 848w, https://substackcdn.com/image/fetch/$s_!ksVf!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F60744dcf-3477-4aa5-9cdb-f020a0ba874c_2472x888.png 1272w, https://substackcdn.com/image/fetch/$s_!ksVf!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F60744dcf-3477-4aa5-9cdb-f020a0ba874c_2472x888.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Blurring an image with a GPU</figcaption></figure></div><p>Here&#8217;s roughly what our code needs to do:</p><ul><li><p>Load the image in the host code</p></li><li><p>Allocate memory on the GPU</p></li><li><p>Copy over the input image to the GPU</p></li><li><p>Blur the image with a kernel</p></li><li><p>Copy over the output image to the CPU</p></li><li><p>Save the output image to the disk</p></li></ul><p>First, we need to know how an image is represented in memory and how to blur it. An RGB image is usually thought of as a 3-dimensional matrix of shape (channels, height, width). In memory, it&#8217;s usually represented as a flat array in row-major order. Our GPU code will assume this format.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!--Nz!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F124766e9-d68d-4638-a87d-3ee650fe8327_1600x339.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!--Nz!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F124766e9-d68d-4638-a87d-3ee650fe8327_1600x339.png 424w, https://substackcdn.com/image/fetch/$s_!--Nz!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F124766e9-d68d-4638-a87d-3ee650fe8327_1600x339.png 848w, https://substackcdn.com/image/fetch/$s_!--Nz!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F124766e9-d68d-4638-a87d-3ee650fe8327_1600x339.png 1272w, https://substackcdn.com/image/fetch/$s_!--Nz!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F124766e9-d68d-4638-a87d-3ee650fe8327_1600x339.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!--Nz!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F124766e9-d68d-4638-a87d-3ee650fe8327_1600x339.png" width="1456" height="308" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/124766e9-d68d-4638-a87d-3ee650fe8327_1600x339.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:308,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!--Nz!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F124766e9-d68d-4638-a87d-3ee650fe8327_1600x339.png 424w, https://substackcdn.com/image/fetch/$s_!--Nz!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F124766e9-d68d-4638-a87d-3ee650fe8327_1600x339.png 848w, https://substackcdn.com/image/fetch/$s_!--Nz!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F124766e9-d68d-4638-a87d-3ee650fe8327_1600x339.png 1272w, https://substackcdn.com/image/fetch/$s_!--Nz!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F124766e9-d68d-4638-a87d-3ee650fe8327_1600x339.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">RGB image represented in row-major order</figcaption></figure></div><p>To access the (n, row, col) pixel in a 3-channel image, we can use the following expression.</p><pre><code>i = (row*WIDTH + column)*3 + n</code></pre><p>To blur an image, we calculate the value of each pixel as the average of surrounding pixels and write the result into the output image.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!2dOp!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5448b7b4-f384-48d8-8acb-f8b9ffa28bd9_2155x752.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!2dOp!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5448b7b4-f384-48d8-8acb-f8b9ffa28bd9_2155x752.png 424w, https://substackcdn.com/image/fetch/$s_!2dOp!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5448b7b4-f384-48d8-8acb-f8b9ffa28bd9_2155x752.png 848w, https://substackcdn.com/image/fetch/$s_!2dOp!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5448b7b4-f384-48d8-8acb-f8b9ffa28bd9_2155x752.png 1272w, https://substackcdn.com/image/fetch/$s_!2dOp!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5448b7b4-f384-48d8-8acb-f8b9ffa28bd9_2155x752.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!2dOp!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5448b7b4-f384-48d8-8acb-f8b9ffa28bd9_2155x752.png" width="1456" height="508" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/5448b7b4-f384-48d8-8acb-f8b9ffa28bd9_2155x752.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:508,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:137516,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!2dOp!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5448b7b4-f384-48d8-8acb-f8b9ffa28bd9_2155x752.png 424w, https://substackcdn.com/image/fetch/$s_!2dOp!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5448b7b4-f384-48d8-8acb-f8b9ffa28bd9_2155x752.png 848w, https://substackcdn.com/image/fetch/$s_!2dOp!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5448b7b4-f384-48d8-8acb-f8b9ffa28bd9_2155x752.png 1272w, https://substackcdn.com/image/fetch/$s_!2dOp!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5448b7b4-f384-48d8-8acb-f8b9ffa28bd9_2155x752.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Image blurring with a 3x3 blur filter</figcaption></figure></div><p>Can we parallelize this? Of course! Each output pixel only depends on the input image, but has no dependencies on other outputs. If we had a processor with width*height cores, we could process every output pixel in parallel. Turns out, that&#8217;s pretty much what GPUs are!</p><h2>Writing a kernel</h2><p>Let&#8217;s finally write our kernel. It will closely follow the 2D example above, but generalize it to n-channel images.&nbsp;</p><p>A kernel in execution is called a thread. Each thread will compute the RGB channels for a single pixel in the image.</p><p>To tell each thread which pixel to compute, CUDA automatically injects variables blockIdx, blockDim, and threadIdx into the kernel. We use these to determine which pixel a given thread should process.</p><p>Once we know which pixel we are processing, we iterate over the neighboring pixels and accumulate their red, green, and blue values in pixVarR, pixVarG, and pixVarB. We also count the number of pixels we&#8217;ve iterated over, to handle cases where the blur-radius reaches beyond the edges of the image.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!01gH!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1d3246d4-f6e3-409d-b86c-afa998fda546_1622x754.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!01gH!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1d3246d4-f6e3-409d-b86c-afa998fda546_1622x754.png 424w, https://substackcdn.com/image/fetch/$s_!01gH!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1d3246d4-f6e3-409d-b86c-afa998fda546_1622x754.png 848w, https://substackcdn.com/image/fetch/$s_!01gH!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1d3246d4-f6e3-409d-b86c-afa998fda546_1622x754.png 1272w, https://substackcdn.com/image/fetch/$s_!01gH!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1d3246d4-f6e3-409d-b86c-afa998fda546_1622x754.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!01gH!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1d3246d4-f6e3-409d-b86c-afa998fda546_1622x754.png" width="558" height="259.4546703296703" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/1d3246d4-f6e3-409d-b86c-afa998fda546_1622x754.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:677,&quot;width&quot;:1456,&quot;resizeWidth&quot;:558,&quot;bytes&quot;:93962,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!01gH!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1d3246d4-f6e3-409d-b86c-afa998fda546_1622x754.png 424w, https://substackcdn.com/image/fetch/$s_!01gH!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1d3246d4-f6e3-409d-b86c-afa998fda546_1622x754.png 848w, https://substackcdn.com/image/fetch/$s_!01gH!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1d3246d4-f6e3-409d-b86c-afa998fda546_1622x754.png 1272w, https://substackcdn.com/image/fetch/$s_!01gH!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1d3246d4-f6e3-409d-b86c-afa998fda546_1622x754.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Applying the blur filter in edge-cases</figcaption></figure></div><p>Note that the coordinate calculation might feel unnatural since the image is flattened as discussed earlier.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!o0WV!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0ecc1921-0240-43c7-8345-b925ab5adc2a_1780x1412.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!o0WV!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0ecc1921-0240-43c7-8345-b925ab5adc2a_1780x1412.png 424w, https://substackcdn.com/image/fetch/$s_!o0WV!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0ecc1921-0240-43c7-8345-b925ab5adc2a_1780x1412.png 848w, https://substackcdn.com/image/fetch/$s_!o0WV!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0ecc1921-0240-43c7-8345-b925ab5adc2a_1780x1412.png 1272w, https://substackcdn.com/image/fetch/$s_!o0WV!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0ecc1921-0240-43c7-8345-b925ab5adc2a_1780x1412.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!o0WV!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0ecc1921-0240-43c7-8345-b925ab5adc2a_1780x1412.png" width="728" height="577.5" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/0ecc1921-0240-43c7-8345-b925ab5adc2a_1780x1412.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:false,&quot;imageSize&quot;:&quot;normal&quot;,&quot;height&quot;:1155,&quot;width&quot;:1456,&quot;resizeWidth&quot;:728,&quot;bytes&quot;:335255,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!o0WV!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0ecc1921-0240-43c7-8345-b925ab5adc2a_1780x1412.png 424w, https://substackcdn.com/image/fetch/$s_!o0WV!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0ecc1921-0240-43c7-8345-b925ab5adc2a_1780x1412.png 848w, https://substackcdn.com/image/fetch/$s_!o0WV!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0ecc1921-0240-43c7-8345-b925ab5adc2a_1780x1412.png 1272w, https://substackcdn.com/image/fetch/$s_!o0WV!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0ecc1921-0240-43c7-8345-b925ab5adc2a_1780x1412.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Blur kernel source code</figcaption></figure></div><p>You might notice the special __global__ identifier before the kernel name. This is how we specify that a procedure is a kernel and should be compiled to run on the GPU. It&#8217;s also how the c cuda compiler (NVCC) knows to inject the blockIdx, blockDim, and threadIdx variables.</p><p>Now that we have the kernel, let&#8217;s briefly write the main function to set things up and run it. The main function closely follows the setup steps outlined earlier. The cuda-prefixed functions are automatically included by the NVCC compiler. These mostly copy built-in C functions to provide similar functionality but for GPUs.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!eUGJ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F24675370-6885-46e7-a440-a1fd494db5a2_2048x2046.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!eUGJ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F24675370-6885-46e7-a440-a1fd494db5a2_2048x2046.png 424w, https://substackcdn.com/image/fetch/$s_!eUGJ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F24675370-6885-46e7-a440-a1fd494db5a2_2048x2046.png 848w, https://substackcdn.com/image/fetch/$s_!eUGJ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F24675370-6885-46e7-a440-a1fd494db5a2_2048x2046.png 1272w, https://substackcdn.com/image/fetch/$s_!eUGJ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F24675370-6885-46e7-a440-a1fd494db5a2_2048x2046.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!eUGJ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F24675370-6885-46e7-a440-a1fd494db5a2_2048x2046.png" width="1456" height="1455" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/24675370-6885-46e7-a440-a1fd494db5a2_2048x2046.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1455,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:550793,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!eUGJ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F24675370-6885-46e7-a440-a1fd494db5a2_2048x2046.png 424w, https://substackcdn.com/image/fetch/$s_!eUGJ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F24675370-6885-46e7-a440-a1fd494db5a2_2048x2046.png 848w, https://substackcdn.com/image/fetch/$s_!eUGJ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F24675370-6885-46e7-a440-a1fd494db5a2_2048x2046.png 1272w, https://substackcdn.com/image/fetch/$s_!eUGJ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F24675370-6885-46e7-a440-a1fd494db5a2_2048x2046.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Main function source code. Main sets up GPU to run blur kernel.</figcaption></figure></div><p>The interesting part is when we call the kernel.&nbsp;</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!uME_!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F98b262d9-d717-4eb8-9379-644ae6dc63ce_1494x518.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!uME_!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F98b262d9-d717-4eb8-9379-644ae6dc63ce_1494x518.png 424w, https://substackcdn.com/image/fetch/$s_!uME_!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F98b262d9-d717-4eb8-9379-644ae6dc63ce_1494x518.png 848w, https://substackcdn.com/image/fetch/$s_!uME_!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F98b262d9-d717-4eb8-9379-644ae6dc63ce_1494x518.png 1272w, https://substackcdn.com/image/fetch/$s_!uME_!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F98b262d9-d717-4eb8-9379-644ae6dc63ce_1494x518.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!uME_!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F98b262d9-d717-4eb8-9379-644ae6dc63ce_1494x518.png" width="1456" height="505" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/98b262d9-d717-4eb8-9379-644ae6dc63ce_1494x518.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:505,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:121276,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!uME_!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F98b262d9-d717-4eb8-9379-644ae6dc63ce_1494x518.png 424w, https://substackcdn.com/image/fetch/$s_!uME_!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F98b262d9-d717-4eb8-9379-644ae6dc63ce_1494x518.png 848w, https://substackcdn.com/image/fetch/$s_!uME_!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F98b262d9-d717-4eb8-9379-644ae6dc63ce_1494x518.png 1272w, https://substackcdn.com/image/fetch/$s_!uME_!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F98b262d9-d717-4eb8-9379-644ae6dc63ce_1494x518.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>We designed our kernel to process all channels for a given pixel with a single thread. </p><p>To make it concrete, our lenna.png is of shape 512 * 512, so we need that many threads to process the whole image. My GPU, however, only has 1920 CUDA cores. That&#8217;s fine - GPUs are much faster at context switching than CPUs, so having more threads than physical cores is desirable to maximize throughput.</p><p>To do this, we use the &lt;&lt;&lt;dimGrid, dimBlock&gt;&gt;&gt; syntax. Each argument is a struct with 3 fields {x, y, z}. The configuration generally follows the shape of the input data. Since each thread corresponds to a single pixel, a natural division is using 2 dimensions. We can group threads into 16x16 blocks, meaning each block will process a 16x16 patch of the image. Since we are not using multiple threads per the z-axis, we leave it at 1.</p><p>The grid dimension tells us how many blocks per dimension to create. Since we need to cover the whole image using 16x16 patches, we need width/16 blocks in the <em>x</em> direction and height/16 blocks in the <em>y</em> direction. In case our image dimensions don&#8217;t divide evenly by 16, we round up.&nbsp;</p><p>This rounding means that we might need to spawn some extra blocks where only some threads are utilized. To make sure these unused threads behave correctly, we added the conditional check in our kernel. Only threads that have a corresponding pixel will do some work!</p><p>Oof, that&#8217;s a lot of low-level details! </p><p>So, why 16x16 blocks? In our case, it is pretty arbitrary. We could&#8217;ve used 8x8 blocks or 32x32. There&#8217;s a hard limit on the number of threads per block, which is usually 1024 on recent cards. As far as I understand, properly organizing threads into blocks and threads can improve performance thanks to memory locality.</p><p>Finally, let&#8217;s compile and run our code. I&#8217;m using BLUR_SIZE=31 for an extra blurry effect.</p><p>We can compile this with NVCC to yield the blurred image.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!2odI!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc20e4d37-a08b-4a19-aad9-831df1ed260a_1158x370.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!2odI!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc20e4d37-a08b-4a19-aad9-831df1ed260a_1158x370.png 424w, https://substackcdn.com/image/fetch/$s_!2odI!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc20e4d37-a08b-4a19-aad9-831df1ed260a_1158x370.png 848w, https://substackcdn.com/image/fetch/$s_!2odI!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc20e4d37-a08b-4a19-aad9-831df1ed260a_1158x370.png 1272w, https://substackcdn.com/image/fetch/$s_!2odI!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc20e4d37-a08b-4a19-aad9-831df1ed260a_1158x370.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!2odI!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc20e4d37-a08b-4a19-aad9-831df1ed260a_1158x370.png" width="548" height="175.09499136442142" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c20e4d37-a08b-4a19-aad9-831df1ed260a_1158x370.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:370,&quot;width&quot;:1158,&quot;resizeWidth&quot;:548,&quot;bytes&quot;:57041,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!2odI!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc20e4d37-a08b-4a19-aad9-831df1ed260a_1158x370.png 424w, https://substackcdn.com/image/fetch/$s_!2odI!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc20e4d37-a08b-4a19-aad9-831df1ed260a_1158x370.png 848w, https://substackcdn.com/image/fetch/$s_!2odI!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc20e4d37-a08b-4a19-aad9-831df1ed260a_1158x370.png 1272w, https://substackcdn.com/image/fetch/$s_!2odI!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc20e4d37-a08b-4a19-aad9-831df1ed260a_1158x370.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">Compilation command</figcaption></figure></div><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!drTc!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe7280cf8-1c95-4f77-b0ab-b9fbd0a41ff1_982x982.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!drTc!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe7280cf8-1c95-4f77-b0ab-b9fbd0a41ff1_982x982.png 424w, https://substackcdn.com/image/fetch/$s_!drTc!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe7280cf8-1c95-4f77-b0ab-b9fbd0a41ff1_982x982.png 848w, https://substackcdn.com/image/fetch/$s_!drTc!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe7280cf8-1c95-4f77-b0ab-b9fbd0a41ff1_982x982.png 1272w, https://substackcdn.com/image/fetch/$s_!drTc!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe7280cf8-1c95-4f77-b0ab-b9fbd0a41ff1_982x982.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!drTc!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe7280cf8-1c95-4f77-b0ab-b9fbd0a41ff1_982x982.png" width="324" height="324" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e7280cf8-1c95-4f77-b0ab-b9fbd0a41ff1_982x982.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:982,&quot;width&quot;:982,&quot;resizeWidth&quot;:324,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!drTc!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe7280cf8-1c95-4f77-b0ab-b9fbd0a41ff1_982x982.png 424w, https://substackcdn.com/image/fetch/$s_!drTc!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe7280cf8-1c95-4f77-b0ab-b9fbd0a41ff1_982x982.png 848w, https://substackcdn.com/image/fetch/$s_!drTc!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe7280cf8-1c95-4f77-b0ab-b9fbd0a41ff1_982x982.png 1272w, https://substackcdn.com/image/fetch/$s_!drTc!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe7280cf8-1c95-4f77-b0ab-b9fbd0a41ff1_982x982.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Blurred Lenna with blur_size=31. You can learn more about the history of Lenna at <a href="http://lenna.org">lenna.org</a></figcaption></figure></div><p>You might wonder if we can run code directly on the GPU without any intervention of the CPU. As far as I know, that&#8217;s not possible. Our executable makes calls to cuda runtime API, which in turn communicates with the GPU drivers. However, it&#8217;s possible to chain kernels to keep as much of the computation on the GPU without CPU intervention.</p><p>If you would like to learn more about this area, consider checking out the book Programming Massively Parallel Processors and the official CUDA C++ Programming guide from Nvidia.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://michalpitr.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Michal&#8217;s Substack! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><div><hr></div><p>Thanks for reading! If you enjoyed this write-up, you might enjoy my previous one where I explain <a href="https://michalpitr.substack.com/p/mapreduce-from-scratch">how MapReduce works by building it from scratch</a>!</p><p>Researching and writing these articles takes a lot of time and effort. To ensure you don&#8217;t miss the next one, consider subscribing or following me on <a href="https://www.linkedin.com/in/michal-pitr-a7156b127/">LinkedIn</a>.</p><div class="captioned-button-wrap" data-attrs="{&quot;url&quot;:&quot;https://michalpitr.substack.com/p/gpu-programming?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;}" data-component-name="CaptionedButtonToDOM"><div class="preamble"><p class="cta-caption">Do you know someone who might be interested in GPU programming? Consider sharing the post with them.</p></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://michalpitr.substack.com/p/gpu-programming?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://michalpitr.substack.com/p/gpu-programming?utm_source=substack&utm_medium=email&utm_content=share&action=share"><span>Share</span></a></p></div>]]></content:encoded></item><item><title><![CDATA[MapReduce from Scratch]]></title><description><![CDATA[Building a distributed computing framwork, step by step.]]></description><link>https://michalpitr.substack.com/p/mapreduce-from-scratch</link><guid isPermaLink="false">https://michalpitr.substack.com/p/mapreduce-from-scratch</guid><dc:creator><![CDATA[Michal Pitr]]></dc:creator><pubDate>Sun, 28 Apr 2024 21:33:39 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc1d3a7a5-e2cc-4778-bafe-03f87e6a6884_1600x671.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Over the last couple of weeks, I&#8217;ve been building MapReduce from scratch. </p><p>This will be a long article: we&#8217;ll understand the need for distributed computing, rediscover why MapReduce is a natural way to model many problems, build our own version, understand how individual parts fit together, and solve a real problem with it!</p><h2>Motivating the problem</h2><p>Suppose we want to count the word frequencies in a massive dataset. We tried to process it on a single machine but found it would take over a month. What can we do?<br><br>The first thought should be to get a faster machine, but it either might not exist or be too expensive. Instead, let&#8217;s see how we can distribute the problem across N commodity machines. We want an easy way to look up the frequency of any word with a simple query, i.e. grep over a file.<br><br>Let&#8217;s start by splitting the dataset into N partitions and computing word frequencies for each subset using a different machine. This already gives us Nx speedup minus some fixed overhead. To combine the results, we can add a final machine that takes these N partial result files and sums the frequencies of corresponding words.<br><br>This is not too far from the core idea behind MapReduce. The above process has two main steps: first, we<em> map</em> words to their frequency in a subset of input data and then <em>reduce</em> the intermediate results to obtain the final answer.&nbsp;</p><p>Note that this is pretty general, imagine we have a large dataset of photos that we want to classify: we could do the image classification task as a map operation and then group images with the same class in the reduce phase.<br><br>Another observation is that the map part is usually the more expensive phase of the two, so generally, we will have more mappers than reducers.</p><p>Having hopefully convinced you that MapReduce is a reasonable idea, let&#8217;s see how the MapReduce paper solves the word frequency problem.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://michalpitr.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Michal&#8217;s Substack! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><h2>Using MapReduce</h2><p>Below, I copy-pasted the WordCounter MapReduce program from the <a href="https://static.googleusercontent.com/media/research.google.com/en//archive/mapreduce-osdi04.pdf">paper</a>. Let&#8217;s see how it works. Later, when we implement our version, we&#8217;ll aim to keep the usage semantics the same.</p><pre><code>#include "mapreduce/mapreduce.h"

// User&#8217;s map function
class WordCounter : public Mapper
{
public:
    virtual void Map(const MapInput &amp;input)
    {
        const string &amp;text = input.value();
        const int n = text.size();
        for (int i = 0; i &lt; n;)
        {
            // Skip past leading whitespace
            while ((i &lt; n) &amp;&amp; isspace(text[i]))
                i++;
            // Find word end
            int start = i;
            while ((i &lt; n) &amp;&amp; !isspace(text[i]))
                i++;
            if (start &lt; i)
                Emit(text.substr(start, i - start), "1");
        }
    }
};
REGISTER_MAPPER(WordCounter);

// User&#8217;s reduce function
class Adder : public Reducer
{
    virtual void Reduce(ReduceInput *input)
    {
        // Iterate over all entries with the
        // same key and add the values
        int64 value = 0;
        while (!input-&gt;done())
        {
            value += StringToInt(input-&gt;value());
            input-&gt;NextValue();
        }
        // Emit sum for input-&gt;key()
        Emit(IntToString(value));
    }
};
REGISTER_REDUCER(Adder);

int main(int argc, char **argv)
{
    ParseCommandLineFlags(argc, argv);
    MapReduceSpecification spec;
    
    // Store list of input files into "spec"
    for (int i = 1; i &lt; argc; i++)
    {
        MapReduceInput *input = spec.add_input();
        input-&gt;set_format("text");
        input-&gt;set_filepattern(argv[i]);
        input-&gt;set_mapper_class("WordCounter");
    }
    
    MapReduceOutput *out = spec.output();
    out-&gt;set_filebase("/gfs/test/freq");
    out-&gt;set_num_tasks(100);
    out-&gt;set_format("text");
    out-&gt;set_reducer_class("Adder");
    
    // Tuning parameters: use at most 2000
    // machines and 100 MB of memory per task
    spec.set_machines(2000);
    spec.set_map_megabytes(100);
    spec.set_reduce_megabytes(100);
    
    // Now run it
    MapReduceResult result;
    if (!MapReduce(spec, &amp;result))
        abort();
    // Done: &#8217;result&#8217; structure contains info
    // about counters, time taken, number of
    // machines used, etc.
    return 0;
}</code></pre><p>The user program has 3 parts: map function, reduce function, and configuration. Most of the heavy lifting is handled by the imported <code>mapreduce</code> library.</p><p>The map function splits the input text into words and emits key-value pairs. For instance, <em>&#8220;the quick brown fox jumps over the lazy dog&#8221;</em> is mapped to these key-value pairs: [&#8220;the&#8221;:1, &#8220;quick&#8221;:1, &#8220;brown&#8221;:1, &#8220;fox&#8221;:1, &#8220;jumps&#8221;: 1, &#8220;over&#8221;:1, &#8220;the&#8221;:1, &#8220;lazy&#8221;:1, &#8220;dog&#8221;:1]. Notice that the pair (&#8220;the&#8221;: 1) is emitted twice.</p><p>The reduce function operates per key. For instance, given the input [&#8220;the&#8221;:1, &#8220;the&#8221;:1&#8221;, &#8220;the&#8221;:1] it would emit (&#8220;the&#8221;:3).</p><p>The config handles input-output, formats, and the number of resources available for the MapReduce job.</p><p>In less than 100 lines of code, we can solve the word counting problem by utilizing 1000s of machines! That&#8217;s pretty neat and shows how powerful of an abstraction MapReduce is.</p><h2>Gathering requirements</h2><p>With the paper as our inspiration, let&#8217;s start outlining the requirements for our implementation:</p><ul><li><p>The user imports our MapReduce library, implements Map and Reduce functions, and provides configuration.</p></li><li><p>When a user executes the binary, it is copied to multiple machines where it executes in either Mapper or Reducer mode.</p></li><li><p>Those machines must have access to input data.</p></li><li><p>Outputs of mappers must be available to reducers.</p></li><li><p>The user can access the final results.</p></li></ul><h2>Infrastructure</h2><p>When I started working on this, these requirements presented two main unknowns: <em>how to distribute the binary to other machines and how to make input data available to them.</em></p><p>After some research, I settled on hosting my input data on a networked storage server - I chose NFS. We can mount the networked directory on every machine and allow the machines to read and write to it. It won&#8217;t be as performant or scalable as Google File System or HDFS, but it will do for our purposes.</p><p>To distribute the binary to multiple machines, execute it, and monitor its status, we could probably get quite far by copying the binary via SCP, SSHing into machines, executing the binary, and then waiting for it to finish. We could even implement health-pinging and progress-reporting.</p><p>However, after twiddling my thumbs for a while, I realized that this sounds awfully like cluster orchestration and decided to leverage Kubernetes to make my life easier. We can build the binary as a docker image and execute it as Kubernetes <a href="https://kubernetes.io/docs/concepts/workloads/controllers/job/">Jobs</a>. Integrating NFS with Kubernetes is easy enough via Persistent Volumes and Persistent Volume Claims. Don&#8217;t worry, you don&#8217;t need to understand these concepts, but I mention them for completeness.</p><p>Since I&#8217;m developing this for educational purposes, I use <a href="https://minikube.sigs.k8s.io/docs/start/">minikube</a> to spin up a local virtual cluster on my laptop. I run my NFS server as a docker container outside the cluster and connect it to the cluster via docker networking. The advantage of this setup is that it can be easily migrated to several physical or virtual machines.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Qr92!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa41a20ea-44f8-49c1-be92-e53ab9f17a2b_1304x747.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Qr92!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa41a20ea-44f8-49c1-be92-e53ab9f17a2b_1304x747.png 424w, https://substackcdn.com/image/fetch/$s_!Qr92!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa41a20ea-44f8-49c1-be92-e53ab9f17a2b_1304x747.png 848w, https://substackcdn.com/image/fetch/$s_!Qr92!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa41a20ea-44f8-49c1-be92-e53ab9f17a2b_1304x747.png 1272w, https://substackcdn.com/image/fetch/$s_!Qr92!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa41a20ea-44f8-49c1-be92-e53ab9f17a2b_1304x747.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Qr92!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa41a20ea-44f8-49c1-be92-e53ab9f17a2b_1304x747.png" width="1304" height="747" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a41a20ea-44f8-49c1-be92-e53ab9f17a2b_1304x747.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:747,&quot;width&quot;:1304,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Qr92!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa41a20ea-44f8-49c1-be92-e53ab9f17a2b_1304x747.png 424w, https://substackcdn.com/image/fetch/$s_!Qr92!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa41a20ea-44f8-49c1-be92-e53ab9f17a2b_1304x747.png 848w, https://substackcdn.com/image/fetch/$s_!Qr92!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa41a20ea-44f8-49c1-be92-e53ab9f17a2b_1304x747.png 1272w, https://substackcdn.com/image/fetch/$s_!Qr92!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa41a20ea-44f8-49c1-be92-e53ab9f17a2b_1304x747.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Infrastructure setup</figcaption></figure></div><p>With infrastructure sorted out, let&#8217;s start writing our MapReduce framework!</p><h2>Using my MapReduce</h2><p>First, we&#8217;ll explore how to solve the word-counting problem using my MapReduce implementation. Then, we&#8217;ll deep dive into the implementation to understand how individual components function.</p><p>As promised, I tried to keep the usage semantics similar to the original paper. Compare this MapReduce program using my Go implementation against the C++ example we saw before! </p><pre><code>// main.go

import (
  &#8230;

  "github.com/MichalPitr/map_reduce/pkg/config"
  "github.com/MichalPitr/map_reduce/pkg/interfaces"
  "github.com/MichalPitr/map_reduce/pkg/mapreduce"
)

type WordCounter struct {
  wordRegex *regexp.Regexp
}

func (wc *WordCounter) Map(input interfaces.MapInput, emit func(key, value string)) {
  text := input.Value()
  text = strings.ToLower(text)
  words := wc.wordRegex.FindAllString(text, -1)
  for _, word := range words {
    emit(word, "1")
  }
}

type Adder struct{}

func (a *Adder) Reduce(input interfaces.ReducerInput, emit func(value string)) {
  val := 0
  for !input.Done() {
    num, err := strconv.Atoi(input.Value())
    if err != nil {
      log.Printf("Failed converting input to integer, skipping: %q", input.Value())
      input.NextValue()
      continue
    }
  val += num
  input.NextValue()
  }
  emit(strconv.Itoa(val))
}

func main() {
  cfg := config.SetupJobConfig()
  log.Printf("cfg: %+v", cfg)
  cfg.NumReducers = 2
  cfg.NumMappers = 4


  cfg.Mapper = &amp;WordCounter{wordRegex: regexp.MustCompile(`\b\w+\b`)}
  cfg.Reducer = &amp;Adder{}

  mapreduce.Execute(cfg)
}</code></pre><p>Let&#8217;s spend some time understanding how my solution works behind the scenes. We can reference a handy diagram from the paper illustrating this. </p><p>The user&#8217;s program is executed in three modes, Master, Mapper, and Reducer, on different machines. At a high level, the master handles the entire job orchestration, mappers do the expensive map operation over input files, and reducers combine intermediate results from mappers. In my solution, the mode is determined by a CLI flag, e.g. <code>--mode=master</code> for master mode. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!v45y!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F96d2f911-95ac-4428-9afe-6d1b0e25340b_818x644.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!v45y!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F96d2f911-95ac-4428-9afe-6d1b0e25340b_818x644.png 424w, https://substackcdn.com/image/fetch/$s_!v45y!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F96d2f911-95ac-4428-9afe-6d1b0e25340b_818x644.png 848w, https://substackcdn.com/image/fetch/$s_!v45y!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F96d2f911-95ac-4428-9afe-6d1b0e25340b_818x644.png 1272w, https://substackcdn.com/image/fetch/$s_!v45y!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F96d2f911-95ac-4428-9afe-6d1b0e25340b_818x644.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!v45y!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F96d2f911-95ac-4428-9afe-6d1b0e25340b_818x644.png" width="728" height="573.1442542787286" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/96d2f911-95ac-4428-9afe-6d1b0e25340b_818x644.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:false,&quot;imageSize&quot;:&quot;normal&quot;,&quot;height&quot;:644,&quot;width&quot;:818,&quot;resizeWidth&quot;:728,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!v45y!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F96d2f911-95ac-4428-9afe-6d1b0e25340b_818x644.png 424w, https://substackcdn.com/image/fetch/$s_!v45y!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F96d2f911-95ac-4428-9afe-6d1b0e25340b_818x644.png 848w, https://substackcdn.com/image/fetch/$s_!v45y!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F96d2f911-95ac-4428-9afe-6d1b0e25340b_818x644.png 1272w, https://substackcdn.com/image/fetch/$s_!v45y!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F96d2f911-95ac-4428-9afe-6d1b0e25340b_818x644.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"></figcaption></figure></div><h2>Master</h2><p>Master mode splits input files into subsets, prepares NFS directories, launches mapper jobs with assigned files, and waits for them to finish. Then it repeats the process for reducers. The main master function is simple enough that I can list it here:</p><pre><code>// /pkg/master/master.go

func Run(cfg *config.Config) {
  clientset := createKubernetesClient()
  numNodes := getNumberOfNodes(clientset)
  mustValidateConfig(cfg, numNodes)

  jobId := fmt.Sprintf("job-%s", time.Now().Format("2006-01-02-15-04-05"))
  log.Printf("Running master: %s", jobId)
&#9;
  mustCreateJobDir(cfg.NfsPath, jobId)
  fileRanges := partitionInputFiles(cfg.InputDir, cfg.NumMappers)

  t0 := time.Now()
  launchMappers(cfg, clientset, jobId, fileRanges)
  waitForJobsToComplete(clientset, jobId, "mapper")
  log.Printf("Mappers took %v to finish", time.Since(t0))

  t1 := time.Now()
  launchReducers(cfg, clientset, jobId)
  waitForJobsToComplete(clientset, jobId, "reducer")
  log.Printf("Reducers took %v to finish", time.Since(t1))
  log.Printf("Total runtime: %v", time.Since(t0))
}</code></pre><p>Let&#8217;s understand the <code>launchMappers</code> function. It creates a Kubernetes job for each mapper. The job spec specifies:</p><ul><li><p>Docker image containing our binary.</p></li><li><p>CLI arguments necessary for the mapper: mapper mode, input/output dirs, and files to process.</p></li><li><p>How to mount the NFS storage.</p></li></ul><p>Wait&#8230;Docker image? We never created one! I&#8217;ll show this in more detail when we finally run our solution, but Kubernetes revolves around containers so we dockerize our binary so that Kubernetes pods can download the binary.</p><p>After launchMappers creates a mapper job, Kubernetes assigns jobs to nodes within the cluster, starts the containers, and monitors their status.</p><p>Now that we can run mappers, it&#8217;s time to learn how they work!</p><h2>Mappers</h2><p>In mapper mode, the binary scans input files line-by-line and passes the line to the user&#8217;s map function. Map processes the text and emits key-value pairs using the injected emit function. </p><p>In my implementation, the emit function is super simple - it stores the key-value pairs in a map of dynamic lists. A more complete implementation would do some buffering and then flush the buffer to the disk.</p><p>When the mapper finishes processing all inputs, it saves sorted key-value pairs to intermediate files in the NFS storage from where reducers will later read them for the final processing. Let&#8217;s zoom out here and see how this mapping from intermediate files to reducers works.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!nQBk!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0e7b17a6-a26c-4f89-9611-020b2b68bbd1_1600x648.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!nQBk!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0e7b17a6-a26c-4f89-9611-020b2b68bbd1_1600x648.png 424w, https://substackcdn.com/image/fetch/$s_!nQBk!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0e7b17a6-a26c-4f89-9611-020b2b68bbd1_1600x648.png 848w, https://substackcdn.com/image/fetch/$s_!nQBk!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0e7b17a6-a26c-4f89-9611-020b2b68bbd1_1600x648.png 1272w, https://substackcdn.com/image/fetch/$s_!nQBk!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0e7b17a6-a26c-4f89-9611-020b2b68bbd1_1600x648.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!nQBk!,w_2400,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0e7b17a6-a26c-4f89-9611-020b2b68bbd1_1600x648.png" width="1020" height="413.3241758241758" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/0e7b17a6-a26c-4f89-9611-020b2b68bbd1_1600x648.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:false,&quot;imageSize&quot;:&quot;large&quot;,&quot;height&quot;:590,&quot;width&quot;:1456,&quot;resizeWidth&quot;:1020,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-large" alt="" srcset="https://substackcdn.com/image/fetch/$s_!nQBk!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0e7b17a6-a26c-4f89-9611-020b2b68bbd1_1600x648.png 424w, https://substackcdn.com/image/fetch/$s_!nQBk!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0e7b17a6-a26c-4f89-9611-020b2b68bbd1_1600x648.png 848w, https://substackcdn.com/image/fetch/$s_!nQBk!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0e7b17a6-a26c-4f89-9611-020b2b68bbd1_1600x648.png 1272w, https://substackcdn.com/image/fetch/$s_!nQBk!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0e7b17a6-a26c-4f89-9611-020b2b68bbd1_1600x648.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>We want to partition the intermediate files by keys such that all the same keys are processed by one reducer. For instance, &#8220;<em>brown&#8221;</em> and <em>&#8220;the&#8221;</em> appear in Mapper 1 and Mapper 2 intermediate files. We want to ensure each ends up being processed by the same reducer. Let&#8217;s see an example.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!mcJ1!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc1d3a7a5-e2cc-4778-bafe-03f87e6a6884_1600x671.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!mcJ1!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc1d3a7a5-e2cc-4778-bafe-03f87e6a6884_1600x671.png 424w, https://substackcdn.com/image/fetch/$s_!mcJ1!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc1d3a7a5-e2cc-4778-bafe-03f87e6a6884_1600x671.png 848w, https://substackcdn.com/image/fetch/$s_!mcJ1!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc1d3a7a5-e2cc-4778-bafe-03f87e6a6884_1600x671.png 1272w, https://substackcdn.com/image/fetch/$s_!mcJ1!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc1d3a7a5-e2cc-4778-bafe-03f87e6a6884_1600x671.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!mcJ1!,w_2400,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc1d3a7a5-e2cc-4778-bafe-03f87e6a6884_1600x671.png" width="1036" height="434.75" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c1d3a7a5-e2cc-4778-bafe-03f87e6a6884_1600x671.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:false,&quot;imageSize&quot;:&quot;large&quot;,&quot;height&quot;:611,&quot;width&quot;:1456,&quot;resizeWidth&quot;:1036,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-large" alt="" srcset="https://substackcdn.com/image/fetch/$s_!mcJ1!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc1d3a7a5-e2cc-4778-bafe-03f87e6a6884_1600x671.png 424w, https://substackcdn.com/image/fetch/$s_!mcJ1!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc1d3a7a5-e2cc-4778-bafe-03f87e6a6884_1600x671.png 848w, https://substackcdn.com/image/fetch/$s_!mcJ1!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc1d3a7a5-e2cc-4778-bafe-03f87e6a6884_1600x671.png 1272w, https://substackcdn.com/image/fetch/$s_!mcJ1!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc1d3a7a5-e2cc-4778-bafe-03f87e6a6884_1600x671.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>To achieve this, when we save intermediate results from mappers, we partition the keys by the number of reducers R using the formula</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\text{hash}(\\text{key}) \\text{ mod} \\text{ R}.&quot;,&quot;id&quot;:&quot;CHOFHMZVSJ&quot;}" data-component-name="LatexBlockToDOM"></div><p>For instance, using FNV hash and R = 2, we get</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;1 \\equiv \\text{FNV}(\\textit{brown}) \\text{ mod } 2 &quot;,&quot;id&quot;:&quot;LIKWPRZPTT&quot;}" data-component-name="LatexBlockToDOM"></div><p>(maths note: this reads as &#8220;1 is congruent to FNV(<em>brown</em>) mod 2&#8221;.)</p><p>Based on this, we assign <em>&#8220;brown&#8221;</em> to the 2nd intermediate file, which is to be processed by reducer 2. Notice how in the example above, all <em>&#8220;the&#8221;</em>s landed in the blue files, while all the <em>&#8220;brown&#8221;</em>s landed in the red files!</p><p>This concludes the discussion on mappers - next, let&#8217;s see how reducers work.</p><h2>Reducers</h2><p>As highlighted previously, the reducer&#8217;s job is to read key-value pairs from the assigned intermediate files and process them using the user-defined reduce function. </p><p>We can be certain of two things: the keys in intermediate files are sorted by key and if a key <strong>A</strong> is present in one of these intermediate files, we are guaranteed that key <strong>A</strong> is not present in any file assigned to other reducers.</p><p>I tried to be smarter about my reducer implementation and avoid loading all intermediate files into memory. This comes with an interesting algorithmic problem:</p><p>Suppose we are trying to process 3 intermediate files, key-value pair at a time without loading everything into memory.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!OYLU!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff4a4c593-bd86-4471-9263-3fff6eadd11e_1076x998.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!OYLU!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff4a4c593-bd86-4471-9263-3fff6eadd11e_1076x998.png 424w, https://substackcdn.com/image/fetch/$s_!OYLU!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff4a4c593-bd86-4471-9263-3fff6eadd11e_1076x998.png 848w, https://substackcdn.com/image/fetch/$s_!OYLU!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff4a4c593-bd86-4471-9263-3fff6eadd11e_1076x998.png 1272w, https://substackcdn.com/image/fetch/$s_!OYLU!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff4a4c593-bd86-4471-9263-3fff6eadd11e_1076x998.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!OYLU!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff4a4c593-bd86-4471-9263-3fff6eadd11e_1076x998.png" width="552" height="511.98513011152414" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f4a4c593-bd86-4471-9263-3fff6eadd11e_1076x998.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:998,&quot;width&quot;:1076,&quot;resizeWidth&quot;:552,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!OYLU!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff4a4c593-bd86-4471-9263-3fff6eadd11e_1076x998.png 424w, https://substackcdn.com/image/fetch/$s_!OYLU!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff4a4c593-bd86-4471-9263-3fff6eadd11e_1076x998.png 848w, https://substackcdn.com/image/fetch/$s_!OYLU!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff4a4c593-bd86-4471-9263-3fff6eadd11e_1076x998.png 1272w, https://substackcdn.com/image/fetch/$s_!OYLU!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff4a4c593-bd86-4471-9263-3fff6eadd11e_1076x998.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>We can merge the key-value pairs on the fly using a min-heap! We load the first key-value pair from each file into the heap. Whenever we pop from the heap, we read the next line from the corresponding file and push it onto the heap. This gives us a memory-efficient way to read a stream of key-value pairs! You can find the implementation <a href="https://github.com/MichalPitr/map_reduce/blob/main/pkg/reducer/stream_merger.go">here</a>. </p><p>After processing all intermediate files, the reducer saves results to the NFS storage.</p><p>For instance, the above example would yield: [&#8220;abby&#8221;: 1, &#8220;alice&#8221;: 3, &#8220;bob&#8221;: 2, &#8220;car&#8221;:2, &#8220;dog&#8221;:1, &#8230;].</p><p>The MapReduce paper introduces a couple of additional optimizations that I skipped in my implementation. Astute readers can probably come up with some already - for instance, couldn&#8217;t we optionally do some reduction already on the mapper?</p><p>Congratulations, you now have a pretty complete understanding of MapReduce! The hard part is over! Let&#8217;s put it together and count the word frequencies in the top <a href="https://www.gutenberg.org/browse/scores/top#authors-last30">100 most popular books on Project Gutenberg</a> using our MapReduce!</p><h2>MapReduce in action</h2><p>I downloaded the 100 most popular books on Project Gutenberg to <code>/mnt/nfs/input/</code> with this <a href="https://github.com/MichalPitr/map_reduce/blob/main/utils/create-dataset.py">script</a>. The book files are labeled <em>book-0</em> to <em>book-99</em>. I also created a 4-node minikube cluster with access to my NFS storage.</p><pre><code>michal@michal-ThinkPad-T490s:/mnt/nfs$ kubectl get nodes
NAME            STATUS   ROLES           AGE    VERSION
multinode       Ready    control-plane   27d    v1.28.3
multinode-m02   Ready    &lt;none&gt;          7d2h   v1.28.3
multinode-m04   Ready    &lt;none&gt;          7d1h   v1.28.3
multinode-m05   Ready    &lt;none&gt;          7d1h   v1.28.3</code></pre><p>I will be reusing the word-frequencies Go main file I showed earlier. We must prepare the docker image using this <a href="https://github.com/MichalPitr/map_reduce/blob/main/Dockerfile">Dockerfile</a> and push it to my registry. The mapper and reducer nodes will pull this image to run our workloads.</p><pre><code>docker build -t michalpitr/mapreduce:latest .
docker push michalpitr/mapreduce:latest</code></pre><p>Now all we need to do is run the master locally!</p><pre><code>go build .
./map_reduce --mode=master --image=michalpitr/mapreduce:latest --input-dir /mnt/input/ --nfs-path /mnt/nfs/</code></pre><p>Let&#8217;s inspect the execution logs.</p><pre><code>michal@michal-ThinkPad-T490s:~/code/map_reduce$ ./map_reduce --mode=ma
ster --image=michalpitr/mapreduce:latest --input-dir /mnt/nfs/input/ -
-nfs-path /mnt/nfs/
2024/04/28 17:13:53 Running master job-2024-04-28-17-13-53
2024/04/28 17:13:53 Creating mapper-0 for book-0-30
2024/04/28 17:13:53 Creating mapper-1 for book-31-53
2024/04/28 17:13:53 Creating mapper-2 for book-54-76
2024/04/28 17:13:53 Creating mapper-3 for book-77-99
2024/04/28 17:13:53 Waiting for jobs to finish.
2024/04/28 17:14:03 Waiting for jobs to finish.
2024/04/28 17:14:13 Waiting for jobs to finish.
2024/04/28 17:14:23 All jobs completed.
2024/04/28 17:14:23 Mappers took 30.1843218s to finish
2024/04/28 17:14:23 Creating reducer-0
2024/04/28 17:14:23 Creating reducer-1
2024/04/28 17:14:23 Waiting for jobs to finish.
2024/04/28 17:14:33 All jobs completed.
2024/04/28 17:14:33 Reducers took 10.106014703s to finish
2024/04/28 17:14:33 Total runtime: 40.290376921s</code></pre><p>The whole job took 40s. We aren&#8217;t expecting anything crazy here, after all, I&#8217;m running this on an underpowered laptop and not a <em>real</em> cluster.</p><p>Let&#8217;s walk through this a little. The master creates 4 mappers, each with an assigned book range.</p><p>Using <code>kubectl</code>, we can see that the cluster is creating 4 containers. This is when the cluster pulls the image we pushed earlier!</p><pre><code>michal@michal-ThinkPad-T490s:/mnt/nfs$ kubectl get pods
NAME             READY   STATUS              RESTARTS   AGE
mapper-0-zgz24   0/1     ContainerCreating   0          11s
mapper-1-m824k   0/1     ContainerCreating   0          11s
mapper-2-ctqqt   0/1     ContainerCreating   0          11s
mapper-3-l2ddh   0/1     ContainerCreating   0          11s</code></pre><p>These don&#8217;t take very long to start running and to finish.</p><pre><code>michal@michal-ThinkPad-T490s:/mnt/nfs$ kubectl get pods
NAME             READY   STATUS      RESTARTS   AGE
mapper-0-zgz24   1/1     Running     0          18s
mapper-1-m824k   0/1     Completed   0          18s
mapper-2-ctqqt   0/1     Completed   0          18s
mapper-3-l2ddh   0/1     Completed   0          18s</code></pre><p>The master periodically polls the Kubernetes API server for the status of these jobs. Once they are all finished, it launches the reducers.</p><p>Before we go to reducers, let&#8217;s see the intermediate files produced by one of the mappers:</p><pre><code>michal@michal-ThinkPad-T490s:/mnt/nfs/job-2024-04-28-17-13-53/mapper-3$ ls
partition-0  partition-1</code></pre><p>Inspecting the end of partition-1 shows key-value pairs of a character from Dostoevsky&#8217;s Notes from Underground. So far so good!</p><pre><code>michal@michal-ThinkPad-T490s:/mnt/nfs/job-2024-04-28-17-13-53/mapper-3$ tail partition-1
zverkov,1
zverkov,1
zverkov,1
zverkov,1
zverkov,1
&#8230;</code></pre><p>Next, the master launches reducers,</p><pre><code>michal@michal-ThinkPad-T490s:/mnt/nfs$ kubectl get pods
NAME              READY   STATUS              RESTARTS   AGE
&#8230;
reducer-0-4xt58   0/1     ContainerCreating   0          2s
reducer-1-xvhfk   0/1     ContainerCreating   0          2s</code></pre><p>which swiftly finish.</p><pre><code>michal@michal-ThinkPad-T490s:/mnt/nfs$ kubectl get pods
NAME              READY   STATUS      RESTARTS   AGE
&#8230;
reducer-0-4xt58   0/1     Completed   0          7s
reducer-1-xvhfk   0/1     Completed   0          7s</code></pre><p>That&#8217;s it - we just computed word frequencies using our very own MapReduce framework! Let&#8217;s see what we can learn from the results!</p><p>I can see the output files in the root of the job folder:</p><pre><code>michal@michal-ThinkPad-T490s:/mnt/nfs/job-2024-04-28-17-13-53$ ls
&#8230;  reducer-0  reducer-1</code></pre><p>Let&#8217;s use grep to find the frequencies of some words.</p><pre><code>michal@michal-ThinkPad-T490s:/mnt/nfs/job-2024-04-28-17-13-53$ grep -w zverkov reducer-0 reducer-1
reducer-1:zverkov,112

michal@michal-ThinkPad-T490s:/mnt/nfs/job-2024-04-28-17-13-53$ grep -w the reducer-0 reducer-1
reducer-1:the,822175

michal@michal-ThinkPad-T490s:/mnt/nfs/job-2024-04-28-17-13-53$ grep -w mapreduce reducer-0 reducer-1
michal@michal-ThinkPad-T490s:/mnt/nfs/job-2024-04-28-17-13-53$ 

michal@michal-ThinkPad-T490s:/mnt/nfs/job-2024-04-28-17-13-53$ grep -w map reducer-0 reducer-1
reducer-1:map,188

michal@michal-ThinkPad-T490s:/mnt/nfs/job-2024-04-28-17-13-53$ grep -w reduce reducer-0 reducer-1
reducer-0:reduce,123</code></pre><p>Unfortunately, there are no mentions of MapReduce in the top 100 most popular books on Project Gutenberg&#8230; maybe one day?</p><p>One final note, notice how these output files partition the result space by key. When we look for a word, it&#8217;s only present in one file! It&#8217;s almost as if we did something right!</p><p>If you made it this far, you might as well check out the <a href="https://github.com/MichalPitr/map_reduce">GitHub repo</a>. You&#8217;ll already be familiar with most ideas!</p><div><hr></div><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://michalpitr.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Michal&#8217;s Substack! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p>Congratulations, you made it! If you enjoyed this deep dive, you might enjoy my previous one into <a href="https://michalpitr.substack.com/p/how-does-sqlite-store-data">SQLite storage format internals</a>. Researching and writing these articles takes a lot of time and effort. To ensure you don&#8217;t miss the next one, consider subscribing or following me on <a href="https://www.linkedin.com/in/michal-pitr-a7156b127/">LinkedIn</a>.</p><p></p>]]></content:encoded></item><item><title><![CDATA[How does SQLite store data?]]></title><description><![CDATA[What I learned by implementing (parts) of SQLite from scratch.]]></description><link>https://michalpitr.substack.com/p/how-does-sqlite-store-data</link><guid isPermaLink="false">https://michalpitr.substack.com/p/how-does-sqlite-store-data</guid><dc:creator><![CDATA[Michal Pitr]]></dc:creator><pubDate>Sun, 17 Mar 2024 16:50:48 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!xgg8!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc850055-defb-4bdb-8da8-779e45b482f5_997x387.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Recently I&#8217;ve been implementing a subset of SQLite (the world&#8217;s most used database, btw) from scratch in Go. I&#8217;ll share what I&#8217;ve learned about how SQLite stores data on disk which will help us understand key database concepts.&nbsp;</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://michalpitr.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Michal&#8217;s Substack! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p>Let&#8217;s keep it hands-on and create a new SQLite database, a Users table, and insert a single user.&nbsp;</p><pre><code><code>michal@michal-ThinkPad-T490s:$ sqlite3 mydb.db
SQLite version 3.37.2 2022-01-06 13:25:41
Enter ".help" for usage hints.
sqlite&gt; CREATE TABLE Users (
  Id INT PRIMARY KEY,
  Username VARCHAR(32),  
  Email VARCHAR(255));
sqlite&gt; INSERT INTO Users (Id, Username, Email) VALUES (1, "michal", "michal@example.com");
sqlite&gt; SELECT * FROM Users WHERE Id = 1;
1|michal|michal@example.com
sqlite&gt;&nbsp;</code></code></pre><p>As expected, everything works just fine. Let&#8217;s exit and reopen the database.</p><pre><code><code>sqlite&gt; .exit
$sqlite3 mydb.db
sqlite&gt; SELECT * FROM Users WHERE Id = 1;
1|michal|michal@example.com</code></code></pre><p>We got our user back. At some point, SQLite wrote our changes to the <code>mydb.db</code> file - SQLite is unusual because it stores all data in a single file. Let&#8217;s inspect this file with a hex editor to see if we can find our data. The first 16 bytes of the file inform us of the SQLite format used.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!m0cW!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1779f8b8-94cc-4968-aca8-be86556d0236_796x107.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!m0cW!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1779f8b8-94cc-4968-aca8-be86556d0236_796x107.png 424w, https://substackcdn.com/image/fetch/$s_!m0cW!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1779f8b8-94cc-4968-aca8-be86556d0236_796x107.png 848w, https://substackcdn.com/image/fetch/$s_!m0cW!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1779f8b8-94cc-4968-aca8-be86556d0236_796x107.png 1272w, https://substackcdn.com/image/fetch/$s_!m0cW!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1779f8b8-94cc-4968-aca8-be86556d0236_796x107.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!m0cW!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1779f8b8-94cc-4968-aca8-be86556d0236_796x107.png" width="796" height="107" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/1779f8b8-94cc-4968-aca8-be86556d0236_796x107.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:107,&quot;width&quot;:796,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!m0cW!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1779f8b8-94cc-4968-aca8-be86556d0236_796x107.png 424w, https://substackcdn.com/image/fetch/$s_!m0cW!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1779f8b8-94cc-4968-aca8-be86556d0236_796x107.png 848w, https://substackcdn.com/image/fetch/$s_!m0cW!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1779f8b8-94cc-4968-aca8-be86556d0236_796x107.png 1272w, https://substackcdn.com/image/fetch/$s_!m0cW!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1779f8b8-94cc-4968-aca8-be86556d0236_796x107.png 1456w" sizes="100vw" fetchpriority="high"></picture><div></div></div></a><figcaption class="image-caption">mydb.db&#8217;s first 32 bytes in a hex editor</figcaption></figure></div><p>A bit deeper, we find the schema for the <code>Users</code> table.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!EyCN!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F25904a73-b81c-4a37-b9a8-ba9c7470f2db_804x292.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!EyCN!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F25904a73-b81c-4a37-b9a8-ba9c7470f2db_804x292.png 424w, https://substackcdn.com/image/fetch/$s_!EyCN!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F25904a73-b81c-4a37-b9a8-ba9c7470f2db_804x292.png 848w, https://substackcdn.com/image/fetch/$s_!EyCN!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F25904a73-b81c-4a37-b9a8-ba9c7470f2db_804x292.png 1272w, https://substackcdn.com/image/fetch/$s_!EyCN!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F25904a73-b81c-4a37-b9a8-ba9c7470f2db_804x292.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!EyCN!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F25904a73-b81c-4a37-b9a8-ba9c7470f2db_804x292.png" width="804" height="292" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/25904a73-b81c-4a37-b9a8-ba9c7470f2db_804x292.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:292,&quot;width&quot;:804,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!EyCN!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F25904a73-b81c-4a37-b9a8-ba9c7470f2db_804x292.png 424w, https://substackcdn.com/image/fetch/$s_!EyCN!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F25904a73-b81c-4a37-b9a8-ba9c7470f2db_804x292.png 848w, https://substackcdn.com/image/fetch/$s_!EyCN!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F25904a73-b81c-4a37-b9a8-ba9c7470f2db_804x292.png 1272w, https://substackcdn.com/image/fetch/$s_!EyCN!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F25904a73-b81c-4a37-b9a8-ba9c7470f2db_804x292.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Users database schema in mydb.db</figcaption></figure></div><p>And finally, the row we inserted!</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!9lHc!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd1f8b0bb-ff6e-4167-b028-c10002532581_811x79.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!9lHc!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd1f8b0bb-ff6e-4167-b028-c10002532581_811x79.png 424w, https://substackcdn.com/image/fetch/$s_!9lHc!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd1f8b0bb-ff6e-4167-b028-c10002532581_811x79.png 848w, https://substackcdn.com/image/fetch/$s_!9lHc!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd1f8b0bb-ff6e-4167-b028-c10002532581_811x79.png 1272w, https://substackcdn.com/image/fetch/$s_!9lHc!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd1f8b0bb-ff6e-4167-b028-c10002532581_811x79.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!9lHc!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd1f8b0bb-ff6e-4167-b028-c10002532581_811x79.png" width="811" height="79" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d1f8b0bb-ff6e-4167-b028-c10002532581_811x79.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:79,&quot;width&quot;:811,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!9lHc!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd1f8b0bb-ff6e-4167-b028-c10002532581_811x79.png 424w, https://substackcdn.com/image/fetch/$s_!9lHc!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd1f8b0bb-ff6e-4167-b028-c10002532581_811x79.png 848w, https://substackcdn.com/image/fetch/$s_!9lHc!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd1f8b0bb-ff6e-4167-b028-c10002532581_811x79.png 1272w, https://substackcdn.com/image/fetch/$s_!9lHc!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd1f8b0bb-ff6e-4167-b028-c10002532581_811x79.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">Inserted row in mydb.db</figcaption></figure></div><p>Let&#8217;s see what else we can learn about this file by running <code>stat mydb.db</code>. </p><pre><code>michal@michal-ThinkPad-T490s:$ stat mydb.db 
  File: mydb.db
  Size: 12288           Blocks: 24         IO Block: 4096   regular file
Device: 10302h/66306d   Inode: 8938166     Links: 1</code></pre><p>The file size is 12288 bytes, we&#8217;ll come back to this. We might also notice that the IO Block is 4096. This also turns out to be the default page size on my machine, but in general, it depends on the OS and file system. </p><p>SQLite stores the database configuration in the first 100 bytes of the root page. The page size is stored at bytes 16-17 and it is 4096 bytes. The number of pages used by the DB is stored in bytes 28-31 and it is 3. Multiply those two together and we get back the file size 4096 * 3 = 12288, exactly what we found with <code>stat</code> earlier!</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!qRhr!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3a5b70ac-b467-4db3-9e5b-4073d9efc86f_995x231.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!qRhr!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3a5b70ac-b467-4db3-9e5b-4073d9efc86f_995x231.png 424w, https://substackcdn.com/image/fetch/$s_!qRhr!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3a5b70ac-b467-4db3-9e5b-4073d9efc86f_995x231.png 848w, https://substackcdn.com/image/fetch/$s_!qRhr!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3a5b70ac-b467-4db3-9e5b-4073d9efc86f_995x231.png 1272w, https://substackcdn.com/image/fetch/$s_!qRhr!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3a5b70ac-b467-4db3-9e5b-4073d9efc86f_995x231.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!qRhr!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3a5b70ac-b467-4db3-9e5b-4073d9efc86f_995x231.png" width="995" height="231" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/3a5b70ac-b467-4db3-9e5b-4073d9efc86f_995x231.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:231,&quot;width&quot;:995,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:44263,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!qRhr!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3a5b70ac-b467-4db3-9e5b-4073d9efc86f_995x231.png 424w, https://substackcdn.com/image/fetch/$s_!qRhr!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3a5b70ac-b467-4db3-9e5b-4073d9efc86f_995x231.png 848w, https://substackcdn.com/image/fetch/$s_!qRhr!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3a5b70ac-b467-4db3-9e5b-4073d9efc86f_995x231.png 1272w, https://substackcdn.com/image/fetch/$s_!qRhr!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3a5b70ac-b467-4db3-9e5b-4073d9efc86f_995x231.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">SQLite root page header</figcaption></figure></div><p>Let&#8217;s recap. We can create a table, insert rows, and store it on disk. Then we can retrieve it from the disk.</p><h2><strong>Why do we split the database into pages?</strong></h2><p>I&#8217;ve gone ahead and populated the <code>Users</code> table with 100k entries using the following format: <code>sprintf(&#8220;%d, user%, user%d@example.com&#8221;, i, i, i)</code></p><pre><code>sqlite&gt; SELECT COUNT(Id) FROM Users;
100000</code></pre><p>To store those 100k entries, SQLite uses 1361 pages taking up 5.4MB. SQLite internally uses a B+ tree data structure to store data. B+ trees are balanced N-ary trees with a (usually) large number of children per internal node. You can think of them like balanced binary search trees, but with more than 2 children per node, stored values only in leaf nodes, and information on how to traverse the tree in the interior nodes.&nbsp;</p><p>Each node in the tree corresponds to a physical page stored on disk. Inserted rows are stored in leaf nodes only. Index information is stored in interior nodes. I might mix calling them pages and nodes, but they mean the same thing.</p><h3>Leaf pages</h3><p>Let&#8217;s start with the simpler leaf pages. Both leaf and interior nodes share the same header format visualized in the diagram below. I&#8217;ve provided a legend for the fields that we care about.&nbsp;</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!xgg8!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc850055-defb-4bdb-8da8-779e45b482f5_997x387.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!xgg8!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc850055-defb-4bdb-8da8-779e45b482f5_997x387.png 424w, https://substackcdn.com/image/fetch/$s_!xgg8!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc850055-defb-4bdb-8da8-779e45b482f5_997x387.png 848w, https://substackcdn.com/image/fetch/$s_!xgg8!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc850055-defb-4bdb-8da8-779e45b482f5_997x387.png 1272w, https://substackcdn.com/image/fetch/$s_!xgg8!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc850055-defb-4bdb-8da8-779e45b482f5_997x387.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!xgg8!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc850055-defb-4bdb-8da8-779e45b482f5_997x387.png" width="997" height="387" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/dc850055-defb-4bdb-8da8-779e45b482f5_997x387.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:387,&quot;width&quot;:997,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!xgg8!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc850055-defb-4bdb-8da8-779e45b482f5_997x387.png 424w, https://substackcdn.com/image/fetch/$s_!xgg8!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc850055-defb-4bdb-8da8-779e45b482f5_997x387.png 848w, https://substackcdn.com/image/fetch/$s_!xgg8!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc850055-defb-4bdb-8da8-779e45b482f5_997x387.png 1272w, https://substackcdn.com/image/fetch/$s_!xgg8!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc850055-defb-4bdb-8da8-779e45b482f5_997x387.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Leaf/Interior page format</figcaption></figure></div><p>The 0th byte of the page determines the page type - <code>0x0d </code>indicates a leaf page. Bytes 3-4 store the number of cells stored in the node. In the leaf page shown below it&#8217;s 123 cells. The entries are stored in reverse order, so we can see that <code>user123</code> is stored first. The next two bytes store the offset at which the first entry starts - 254. Bytes 8-11 are only used by interior pages.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Na35!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F25873472-fae2-45ae-85b7-81f02a932ad7_726x636.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Na35!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F25873472-fae2-45ae-85b7-81f02a932ad7_726x636.png 424w, https://substackcdn.com/image/fetch/$s_!Na35!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F25873472-fae2-45ae-85b7-81f02a932ad7_726x636.png 848w, https://substackcdn.com/image/fetch/$s_!Na35!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F25873472-fae2-45ae-85b7-81f02a932ad7_726x636.png 1272w, https://substackcdn.com/image/fetch/$s_!Na35!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F25873472-fae2-45ae-85b7-81f02a932ad7_726x636.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Na35!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F25873472-fae2-45ae-85b7-81f02a932ad7_726x636.png" width="726" height="636" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/25873472-fae2-45ae-85b7-81f02a932ad7_726x636.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:636,&quot;width&quot;:726,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:224090,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Na35!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F25873472-fae2-45ae-85b7-81f02a932ad7_726x636.png 424w, https://substackcdn.com/image/fetch/$s_!Na35!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F25873472-fae2-45ae-85b7-81f02a932ad7_726x636.png 848w, https://substackcdn.com/image/fetch/$s_!Na35!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F25873472-fae2-45ae-85b7-81f02a932ad7_726x636.png 1272w, https://substackcdn.com/image/fetch/$s_!Na35!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F25873472-fae2-45ae-85b7-81f02a932ad7_726x636.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Leaf page in hex editor</figcaption></figure></div><p>Once we get to the first entry, we see an array of rows. When I was implementing my version of SQLite from scratch, I also stored a pointer to the parent page in the leaf page to make the implementation a little simpler, since sometimes you need to access the parent node from a leaf.</p><p>Next, let&#8217;s take a look at the more exciting interior nodes.</p><h3>Interior pages</h3><p>So far, we know how to store data. What&#8217;s missing is a way to efficiently retrieve it. That is precisely the role of interior pages in a B+ tree. Let&#8217;s see how they achieve this.&nbsp;</p><p>Interior pages have the 0th byte set to <code>0x05</code>. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!xsPL!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2561849a-ef50-4eaf-8e17-52d02aa1849d_1142x343.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!xsPL!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2561849a-ef50-4eaf-8e17-52d02aa1849d_1142x343.png 424w, https://substackcdn.com/image/fetch/$s_!xsPL!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2561849a-ef50-4eaf-8e17-52d02aa1849d_1142x343.png 848w, https://substackcdn.com/image/fetch/$s_!xsPL!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2561849a-ef50-4eaf-8e17-52d02aa1849d_1142x343.png 1272w, https://substackcdn.com/image/fetch/$s_!xsPL!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2561849a-ef50-4eaf-8e17-52d02aa1849d_1142x343.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!xsPL!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2561849a-ef50-4eaf-8e17-52d02aa1849d_1142x343.png" width="1142" height="343" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/2561849a-ef50-4eaf-8e17-52d02aa1849d_1142x343.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:343,&quot;width&quot;:1142,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:70064,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!xsPL!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2561849a-ef50-4eaf-8e17-52d02aa1849d_1142x343.png 424w, https://substackcdn.com/image/fetch/$s_!xsPL!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2561849a-ef50-4eaf-8e17-52d02aa1849d_1142x343.png 848w, https://substackcdn.com/image/fetch/$s_!xsPL!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2561849a-ef50-4eaf-8e17-52d02aa1849d_1142x343.png 1272w, https://substackcdn.com/image/fetch/$s_!xsPL!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2561849a-ef50-4eaf-8e17-52d02aa1849d_1142x343.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Interior page in hex editor</figcaption></figure></div><p>The header format is the same as for leaf nodes, so the number of cells is in bytes 3-4, here it&#8217;s 2 cells. The start of the cell array is indicated by bytes 5-6 as offset from the start of the page: 4082. The right child pointer is stored at bytes 8-11: 1216.</p><p>Let&#8217;s look at the array of cells and try to understand the cell format for interior pages.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!6zhI!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F351b501c-112b-4c0f-ae51-d36d59002685_797x105.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!6zhI!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F351b501c-112b-4c0f-ae51-d36d59002685_797x105.png 424w, https://substackcdn.com/image/fetch/$s_!6zhI!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F351b501c-112b-4c0f-ae51-d36d59002685_797x105.png 848w, https://substackcdn.com/image/fetch/$s_!6zhI!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F351b501c-112b-4c0f-ae51-d36d59002685_797x105.png 1272w, https://substackcdn.com/image/fetch/$s_!6zhI!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F351b501c-112b-4c0f-ae51-d36d59002685_797x105.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!6zhI!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F351b501c-112b-4c0f-ae51-d36d59002685_797x105.png" width="797" height="105" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/351b501c-112b-4c0f-ae51-d36d59002685_797x105.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:105,&quot;width&quot;:797,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!6zhI!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F351b501c-112b-4c0f-ae51-d36d59002685_797x105.png 424w, https://substackcdn.com/image/fetch/$s_!6zhI!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F351b501c-112b-4c0f-ae51-d36d59002685_797x105.png 848w, https://substackcdn.com/image/fetch/$s_!6zhI!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F351b501c-112b-4c0f-ae51-d36d59002685_797x105.png 1272w, https://substackcdn.com/image/fetch/$s_!6zhI!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F351b501c-112b-4c0f-ae51-d36d59002685_797x105.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">Interior page cell array in hex editor</figcaption></figure></div><p>The array starts at <code>0x1ff2</code>. Each cell is composed of an uint32 page pointer followed by a varint max primary key.&nbsp;</p><p>Interpreting the big-endian uint32 is straightforward: <code>0x00000267 = 615</code>.</p><p>Varint is an encoding that can represent up to 64-bit unsigned ints using 1 to 9 bytes but is more memory efficient for smaller integers.</p><p>To decode a varint, we need to look at each byte, starting with the most significant one, and write it in binary:</p><pre><code>0x8493370
x84 = 10000100
0x93 = 10010011
0x37 = 00110111</code></pre><p>When we reach a byte with the most significant bit set to 0, we have hit the last byte of the varint, which is how I knew that we only needed these 3 bytes. The stored value is then calculated by concatenating the 7 least significant bits:</p><pre><code>0000100 | 0010011 | 0110111 = 68023 in decimal</code></pre><p>Similarly for the second cell:</p><pre><code>0x00000266 = 614
0x82962C = 0000010 | 0010110 | 0101100 = 35628</code></pre><p>The values stored as varints indicate that all rows in the subtree pointed to by the page pointer have a primary key lower or equal to this key. Think of it again like a generalization of binary search trees.</p><p>So what does this mean? It means that if we are looking for a row with a primary key below or equal to 35628, we should visit page 614. If the key is greater than that but less than or equal to 68023, we should visit page 615. For anything larger, we should follow the right-most pointer to page 1216. </p><p>It&#8217;s giving us directions on how to find the row we are looking for!</p><h2>Looking for user4242</h2><p>Now that we understand interior and leaf pages, let&#8217;s look for the user with <code>Id = 4242</code> to see how this works end to end.</p><p>In the previous section, we looked at the root interior node and found that page 614 has entries &lt;= 35628 so let&#8217;s look there.</p><p>We need to go to page 614, which is at byte offset (614-1) * 4096 (0-indexed) in the <code>mydb.db </code>file. It&#8217;s again an internal node as indicated by the 0th byte. I&#8217;ll just list the cell entries but notice that I can do some skipping. It turns out that SQLite keeps the cells in sorted order - we can use binary search!</p><pre><code>Page_num, max_key
461, 0000010|0010101|1001101 = 35533
&#8230;
79,0110000|0111011 = 6203
&#8230;
60, 0100100|0100011 = 4643
&#8230;
58, 0100011|0111011 = 4539
57, 0100010|1010011 = 4435
<strong>56,0100001|1101011 = 4331</strong>
55, 0100001|0000011 = 4227
&#8230;
4, 123</code></pre><p><code>User4242</code> should be on page 56. Let&#8217;s inspect it and woohoo - it&#8217;s a leaf node! We should be close!</p><p>Notice how the largest entered user is indeed <code>user4331</code> as indicated by the max key stored in the parent interior node.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!tozz!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2be3a626-ffbd-4f59-a56c-63ebad1a2229_794x569.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!tozz!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2be3a626-ffbd-4f59-a56c-63ebad1a2229_794x569.png 424w, https://substackcdn.com/image/fetch/$s_!tozz!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2be3a626-ffbd-4f59-a56c-63ebad1a2229_794x569.png 848w, https://substackcdn.com/image/fetch/$s_!tozz!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2be3a626-ffbd-4f59-a56c-63ebad1a2229_794x569.png 1272w, https://substackcdn.com/image/fetch/$s_!tozz!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2be3a626-ffbd-4f59-a56c-63ebad1a2229_794x569.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!tozz!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2be3a626-ffbd-4f59-a56c-63ebad1a2229_794x569.png" width="794" height="569" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/2be3a626-ffbd-4f59-a56c-63ebad1a2229_794x569.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:569,&quot;width&quot;:794,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:247843,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!tozz!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2be3a626-ffbd-4f59-a56c-63ebad1a2229_794x569.png 424w, https://substackcdn.com/image/fetch/$s_!tozz!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2be3a626-ffbd-4f59-a56c-63ebad1a2229_794x569.png 848w, https://substackcdn.com/image/fetch/$s_!tozz!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2be3a626-ffbd-4f59-a56c-63ebad1a2229_794x569.png 1272w, https://substackcdn.com/image/fetch/$s_!tozz!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2be3a626-ffbd-4f59-a56c-63ebad1a2229_794x569.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Leaf page with largest primary key highlighted</figcaption></figure></div><p>Let&#8217;s go find our <code>user4242</code>. If we scroll a bit down, we finally see the row we wanted!</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!LwzT!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fad7f6648-5267-47d6-b914-73b49f0f4710_799x137.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!LwzT!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fad7f6648-5267-47d6-b914-73b49f0f4710_799x137.png 424w, https://substackcdn.com/image/fetch/$s_!LwzT!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fad7f6648-5267-47d6-b914-73b49f0f4710_799x137.png 848w, https://substackcdn.com/image/fetch/$s_!LwzT!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fad7f6648-5267-47d6-b914-73b49f0f4710_799x137.png 1272w, https://substackcdn.com/image/fetch/$s_!LwzT!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fad7f6648-5267-47d6-b914-73b49f0f4710_799x137.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!LwzT!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fad7f6648-5267-47d6-b914-73b49f0f4710_799x137.png" width="799" height="137" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ad7f6648-5267-47d6-b914-73b49f0f4710_799x137.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:137,&quot;width&quot;:799,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!LwzT!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fad7f6648-5267-47d6-b914-73b49f0f4710_799x137.png 424w, https://substackcdn.com/image/fetch/$s_!LwzT!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fad7f6648-5267-47d6-b914-73b49f0f4710_799x137.png 848w, https://substackcdn.com/image/fetch/$s_!LwzT!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fad7f6648-5267-47d6-b914-73b49f0f4710_799x137.png 1272w, https://substackcdn.com/image/fetch/$s_!LwzT!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fad7f6648-5267-47d6-b914-73b49f0f4710_799x137.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">Leaf page cell with user4242 highlighted</figcaption></figure></div><p>This is amazing - we were able to find our row by just visiting 3 pages: the root page, internal page 614, and leaf page 56. This is important because it means SQLite only had to read those 3 pages from the disk to find the row! Reading a specific page from a file can be implemented efficiently with <code>fseek</code> and reading only <code>PAGE_SIZE</code> bytes.</p><p>Disk IO tends to be the slowest part of a database, especially if using physical hard drives. If we didn&#8217;t have an index (interior nodes of the B+ tree), we would have to read all 1361 pages. Here we used a tiny database - just 100k rows, but this logarithmic scaling property becomes crucial as the database grows.</p><p>If you are wondering how I learned this, I&#8217;ve been implementing a subset of <a href="https://github.com/MichalPitr/db_from_scratch">SQLite from scratch in Go</a>. I started by following <a href="https://cstack.github.io/db_tutorial/">cstack&#8217;s SQLite from scratch blog series</a>. For SQLite implementational details, the official <a href="https://www.sqlite.org/fileformat2.html">SQLite documentation </a>is fantastic.</p><div><hr></div><p>I hope you enjoyed this deep dive into database internals! If you did, consider subscribing and/or following me on <a href="https://www.linkedin.com/in/michal-pitr-a7156b127/">LinkedIn</a>.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://michalpitr.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Michal&#8217;s Substack! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[Tail Recursion: From Python to OCaml]]></title><description><![CDATA[Practical exploration of TCO]]></description><link>https://michalpitr.substack.com/p/tail-recursion-from-python-to-ocaml-f219ea8ba13a</link><guid isPermaLink="false">https://michalpitr.substack.com/p/tail-recursion-from-python-to-ocaml-f219ea8ba13a</guid><dc:creator><![CDATA[Michal Pitr]]></dc:creator><pubDate>Sun, 08 Oct 2023 19:28:48 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/47610a62-6452-4184-a98e-f21840db277a_800x457.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Gxnm!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe4a42fcc-c0c7-45fc-b1dd-5f9068d61be9_800x457.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Gxnm!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe4a42fcc-c0c7-45fc-b1dd-5f9068d61be9_800x457.png 424w, https://substackcdn.com/image/fetch/$s_!Gxnm!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe4a42fcc-c0c7-45fc-b1dd-5f9068d61be9_800x457.png 848w, https://substackcdn.com/image/fetch/$s_!Gxnm!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe4a42fcc-c0c7-45fc-b1dd-5f9068d61be9_800x457.png 1272w, https://substackcdn.com/image/fetch/$s_!Gxnm!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe4a42fcc-c0c7-45fc-b1dd-5f9068d61be9_800x457.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Gxnm!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe4a42fcc-c0c7-45fc-b1dd-5f9068d61be9_800x457.png" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e4a42fcc-c0c7-45fc-b1dd-5f9068d61be9_800x457.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:null,&quot;width&quot;:null,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Gxnm!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe4a42fcc-c0c7-45fc-b1dd-5f9068d61be9_800x457.png 424w, https://substackcdn.com/image/fetch/$s_!Gxnm!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe4a42fcc-c0c7-45fc-b1dd-5f9068d61be9_800x457.png 848w, https://substackcdn.com/image/fetch/$s_!Gxnm!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe4a42fcc-c0c7-45fc-b1dd-5f9068d61be9_800x457.png 1272w, https://substackcdn.com/image/fetch/$s_!Gxnm!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe4a42fcc-c0c7-45fc-b1dd-5f9068d61be9_800x457.png 1456w" sizes="100vw" fetchpriority="high"></picture><div></div></div></a><figcaption class="image-caption">DALL-E 3 thought this image illustrated the article well. It&#8217;s kinda&nbsp;cute.</figcaption></figure></div><p>I&#8217;ve been learning about compilers lately and about how call stacks work. This got me wondering how functional programming languages, where <em>everything</em> is handled recursively deal with maximum call stack size.</p><h3>Recursion in&nbsp;Python</h3><p>Let&#8217;s start with something familiar and write a valid albeit impractical Python function to sum numbers from 1 to n.</p><pre><code>def recursive_sum(n: int) -&gt; int:
    if n &lt;= 0:
        return 0
    return n + recursive_sum(n-1)</code></pre><p>It feels <em>weird</em> but it&#8217;s pretty much exactly how we&#8217;d write this in a functional language like Haskell or OCaml. When I run this with sufficiently large input, 1000 is enough on my machine, I get an error:</p><pre><code><code>RecursionError: maximum recursion depth exceeded.</code></code></pre><p>Every time a function call is made, CPython pushes a new stack frame to the call stack. Among other things, each frame contains function arguments and local variables. When the function returns, the stack frame can be popped from the call stack.</p><p>Here&#8217;s what the call stack looks like for n = 5.</p><pre><code>recursive_sum(5)
5 + recursive_sum(4)
5 + (4 + recursive_sum(3))
5 + (4 + (3 + recursive_sum(2)))
5 + (4 + (3 + (2 + recursive_sum(1))))
5 + (4 + (3 + (2 + (1 + recursive_sum(0))))
5 + (4 + (3 + (2 + (1))))
5 + (4 + (3 + (3)))
5 + (4 + (6))
5 + (10)
15</code></pre><p>So why do we get a RecursionError?</p><p>Python assumes that a function that exceeds the recursion limit is incorrect&#8202;&#8212;&#8202;maybe the user forgot to add a base case. That&#8217;s a good thing and lets users handle the exception. But in our case, we know that <code>recursive_sum</code> is valid and would terminate if CPython didn&#8217;t stop it.</p><p>The other part is memory usage. Each stack frame takes up memory. It&#8217;s more graceful to let the user handle the exception than for the program to unexpectedly crash because it ran out of memory.</p><p>Turns out, there is a way to get pseudo-tail recursion in Python by leveraging exceptions to unwind the stack. Thanks to <a href="https://chrispenner.ca/posts/python-tail-recursion">Chris Penner&#8217;s article</a> for explaining this well.</p><h3>Tail recursion</h3><p>Functional programming languages do something pretty smart for recursive functions that meet a certain form&#8212; the current stack frame is reused for the next call.</p><p>First, let&#8217;s rewrite our Python function into a tail-recursive form.</p><pre><code>def tail_recursive_sum(n: int, accumulator: int) -&gt; int:
    if n &lt;= 0:
        return accumulator
    return tail_recursive_sum(n-1, accumulator + n)</code></pre><p>All I did was introduce an accumulator to the function parameters. Then whenever it makes a recursive call, the accumulator is incremented.</p><p>Let&#8217;s illustrate how this works.</p><pre><code>tail_recursive_sum(5, 0)
| tail_recursive_sum(4, 0 + 5)
| | tail_recursive_sum(3, 5 + 4)
| | | tail_recursive_sum(2, 9 + 3)
| | | | tail_recursive_sum(1, 12 + 2)
| | | | | tail_recursive_sum(0, 14 + 1)
| | | | | 15
| | | | 15
| | | 15
| | 15
| 15
15</code></pre><p>Notice that once the function reaches the base case, it immediately has the result. However, Python does not take advantage of this. If we run the function above with n = 1000, we get the familiar <code>maximum recursion depth exceeded.</code></p><p>What we can notice though, is that previous stack frames don&#8217;t serve any purpose. Compilers that optimize tail recursion take advantage of this fact to reuse the existing stack frame! The call stack then looks something like this:</p><pre><code>tail_recursive_sum(5, 0)
tail_recursive_sum(4, 0 + 5)
tail_recursive_sum(3, 5 + 4)
tail_recursive_sum(2, 9 + 3)
tail_recursive_sum(1, 12 + 2)
tail_recursive_sum(0, 14 + 1)
15</code></pre><p>Next, let&#8217;s see how this works in a language that supports it!</p><h3>Tail recursion in&nbsp;OCaml</h3><p>Here&#8217;s our <code>tail_recursive_sum </code>written in OCaml.</p><p>To explain the syntax, we define a function with the <code>tail_recursive_sum</code> name with two parameters <code>n</code> and <code>accumulator. </code>We use the <code>rec</code> keyword to tell the compiler that it&#8217;s a recursive function. Note that there&#8217;s no explicit return statement but the last expression is implicitly returned.</p><pre><code>let rec tail_recursive_sum n accumulator =
    if n &lt;= 0 then accumulator
    else tail_recursive_sum (n-1) (accumulator + n)</code></pre><p>For good measure, let&#8217;s test it with<code>10^9,</code>which correctly returns <code>500000000500000000</code></p><p>Woohoo! That&#8217;s pretty remarkable. It would work for larger values as well, but we&#8217;d soon need to start worrying about integer overflow.</p><p>There&#8217;s one last loose end to tie up. To verify that this behavior is indeed due to tail recursion optimization and not a fluke, let&#8217;s try using the non-tail recursive form that we used before.</p><pre><code>let rec sum n =
    if n &lt;= 1 then 1
    else n + sum (n-1)</code></pre><p>As we would hope, we start getting <code>Fatal error: exception Stack_overflow </code>errors starting at around <code>n = 10^6.</code> Great!</p><div><hr></div><p>If you enjoyed this shorter write-up, consider subscribing. I write technical deep dives, often about implementing complex software from scratch to properly understand its inner workings.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://michalpitr.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Michal&#8217;s Substack! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p>Maybe take a look at one of my other posts?<br></p><div class="digest-post-embed" data-attrs="{&quot;nodeId&quot;:&quot;56852bc8-b05e-459c-9c34-62359a8b90ee&quot;,&quot;caption&quot;:&quot;Recently I&#8217;ve been implementing a subset of SQLite (the world&#8217;s most used database, btw) from scratch in Go. I&#8217;ll share what I&#8217;ve learned about how SQLite stores data on disk which will help us understand key database concepts. Thanks for reading Michal&#8217;s Substack! Subscribe for free to receive new posts and support my work.&quot;,&quot;cta&quot;:null,&quot;showBylines&quot;:true,&quot;size&quot;:&quot;sm&quot;,&quot;isEditorNode&quot;:true,&quot;title&quot;:&quot;How does SQLite store data?&quot;,&quot;publishedBylines&quot;:[{&quot;id&quot;:28235731,&quot;name&quot;:&quot;Michal Pitr&quot;,&quot;bio&quot;:&quot;I write deep dives into software engineering topics.&quot;,&quot;photo_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/73a6e774-038e-424d-afee-ce5041c3e7e0_1080x1080.jpeg&quot;,&quot;is_guest&quot;:false,&quot;bestseller_tier&quot;:null}],&quot;post_date&quot;:&quot;2024-03-17T16:50:48.826Z&quot;,&quot;cover_image&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc850055-defb-4bdb-8da8-779e45b482f5_997x387.png&quot;,&quot;cover_image_alt&quot;:null,&quot;canonical_url&quot;:&quot;https://michalpitr.substack.com/p/how-does-sqlite-store-data&quot;,&quot;section_name&quot;:null,&quot;video_upload_id&quot;:null,&quot;id&quot;:142692526,&quot;type&quot;:&quot;newsletter&quot;,&quot;reaction_count&quot;:16,&quot;comment_count&quot;:0,&quot;publication_id&quot;:null,&quot;publication_name&quot;:&quot;Michal&#8217;s Substack&quot;,&quot;publication_logo_url&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbe5ab36b-c782-420e-ac35-c7599a1f77ad_976x976.png&quot;,&quot;belowTheFold&quot;:true,&quot;youtube_url&quot;:null,&quot;show_links&quot;:null,&quot;feed_url&quot;:null}"></div><div class="digest-post-embed" data-attrs="{&quot;nodeId&quot;:&quot;368a658f-fb37-4618-98cc-0cfe91f750c0&quot;,&quot;caption&quot;:&quot;I like to keep things practical. Let&#8217;s train a simple neural network, save the model, and write an inference engine that can execute inputs against the model. Sounds like a fun time to me!&quot;,&quot;cta&quot;:null,&quot;showBylines&quot;:true,&quot;size&quot;:&quot;sm&quot;,&quot;isEditorNode&quot;:true,&quot;title&quot;:&quot;Build Your Own Inference Engine: From Scratch to \&quot;7\&quot;&quot;,&quot;publishedBylines&quot;:[{&quot;id&quot;:28235731,&quot;name&quot;:&quot;Michal Pitr&quot;,&quot;bio&quot;:&quot;I write deep dives into software engineering topics.&quot;,&quot;photo_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/73a6e774-038e-424d-afee-ce5041c3e7e0_1080x1080.jpeg&quot;,&quot;is_guest&quot;:false,&quot;bestseller_tier&quot;:null}],&quot;post_date&quot;:&quot;2024-08-04T15:27:57.810Z&quot;,&quot;cover_image&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6e6e7b0d-48b1-4dc0-886c-96b2d844f2dc_1340x1106.png&quot;,&quot;cover_image_alt&quot;:null,&quot;canonical_url&quot;:&quot;https://michalpitr.substack.com/p/build-your-own-inference-engine-from&quot;,&quot;section_name&quot;:null,&quot;video_upload_id&quot;:null,&quot;id&quot;:147338023,&quot;type&quot;:&quot;newsletter&quot;,&quot;reaction_count&quot;:8,&quot;comment_count&quot;:0,&quot;publication_id&quot;:null,&quot;publication_name&quot;:&quot;Michal&#8217;s Substack&quot;,&quot;publication_logo_url&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbe5ab36b-c782-420e-ac35-c7599a1f77ad_976x976.png&quot;,&quot;belowTheFold&quot;:true,&quot;youtube_url&quot;:null,&quot;show_links&quot;:null,&quot;feed_url&quot;:null}"></div><div class="digest-post-embed" data-attrs="{&quot;nodeId&quot;:&quot;6f836a59-793b-4957-8a8c-1ef0a4a6e135&quot;,&quot;caption&quot;:&quot;One of the great joys of software engineering is dispelling magic. I&#8217;ve written code that executed on a GPU using frameworks like PyTorch or TensorFlow, but I never understood the &#8220;how&#8221;. It&#8217;s time to dispel the magic of GPU programming and learn how it works under the hood.&quot;,&quot;cta&quot;:null,&quot;showBylines&quot;:true,&quot;size&quot;:&quot;sm&quot;,&quot;isEditorNode&quot;:true,&quot;title&quot;:&quot;GPU Programming&quot;,&quot;publishedBylines&quot;:[{&quot;id&quot;:28235731,&quot;name&quot;:&quot;Michal Pitr&quot;,&quot;bio&quot;:&quot;I write deep dives into software engineering topics.&quot;,&quot;photo_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/73a6e774-038e-424d-afee-ce5041c3e7e0_1080x1080.jpeg&quot;,&quot;is_guest&quot;:false,&quot;bestseller_tier&quot;:null}],&quot;post_date&quot;:&quot;2024-05-04T19:25:04.223Z&quot;,&quot;cover_image&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F60744dcf-3477-4aa5-9cdb-f020a0ba874c_2472x888.png&quot;,&quot;cover_image_alt&quot;:null,&quot;canonical_url&quot;:&quot;https://michalpitr.substack.com/p/gpu-programming&quot;,&quot;section_name&quot;:null,&quot;video_upload_id&quot;:null,&quot;id&quot;:144305968,&quot;type&quot;:&quot;newsletter&quot;,&quot;reaction_count&quot;:3,&quot;comment_count&quot;:0,&quot;publication_id&quot;:null,&quot;publication_name&quot;:&quot;Michal&#8217;s Substack&quot;,&quot;publication_logo_url&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbe5ab36b-c782-420e-ac35-c7599a1f77ad_976x976.png&quot;,&quot;belowTheFold&quot;:true,&quot;youtube_url&quot;:null,&quot;show_links&quot;:null,&quot;feed_url&quot;:null}"></div>]]></content:encoded></item><item><title><![CDATA[Fast Scanning: Detecting keywords]]></title><description><![CDATA[Hash Tables vs. Tries for keyword detection]]></description><link>https://michalpitr.substack.com/p/fast-scanning-detecting-keywords-c58bd64befeb</link><guid isPermaLink="false">https://michalpitr.substack.com/p/fast-scanning-detecting-keywords-c58bd64befeb</guid><dc:creator><![CDATA[Michal Pitr]]></dc:creator><pubDate>Sat, 30 Sep 2023 13:51:33 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/95450be1-f5a0-40de-bdbf-f380bf5eb3ac_133x292.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>When scanning a source file, the scanner (or lexer) needs to distinguish reserved keywords, such as <code>var, for, or, and, class</code> from variable names like <code>foo, bar</code>. While working my way through <a href="https://craftinginterpreters.com/">Crafting Interpreters</a>, I stumbled upon a simple optimization that uses tries implemented with a switch statement instead of hashmaps. I wanted to verify if this really is faster.</p><p>I&#8217;ve included relevant code listings below. You can see the full repository here: <a href="https://github.com/MichalPitr/hashmap_vs_trie_c">https://github.com/MichalPitr/hashmap_vs_trie_c</a></p><p>Here&#8217;s the setup. We have a sliding window scanner that reads the source code and emits Tokens which are easier to work with. To illustrate, the source code</p><pre><code><code>var name = &#8220;Michael&#8221;;</code></code></pre><p>would be converted to the following tokens</p><pre><code><code>TOKEN_VAR, TOKEN_IDENTIFIER, TOKEN_EQUAL, TOKEN_STRING, TOKEN_SEMICOLON</code></code></pre><p>Each token would have associated metadata, like start and length so that we can retrieve the literal string from the source code.</p><p>When the scanner sees a lexeme (a sequence of characters), it decides what Token it is. <code>{</code> maps to <code>TOKEN_LEFT_BRACE</code>, <code>==</code> maps to <code>TOKEN_EQUAL_EQUAL</code>.</p><p>How does it go about differentiating keywords and identifiers, such as variable names or function names? The rest of this write-up compares two approaches: Hash tables and tries.</p><h3>Hash Tables</h3><p>Let&#8217;s start with hash tables (HashMap in Java, dictionary in Python, unordered_map in C++) since they are the intuitive go-to solution. I use simple chaining hash tables. When the scanner starts, we can create a hash table where keywords map to their TokenType. Then whenever the scanner finds a lexeme that is either a keyword or an identifier, it looks it up in the hash table.</p><p>The listing below shows a function <code>TokenType identifierTypeUsingHashMap()</code> that returns the TokenType of the current lexeme pointed to by <code>scanner.start</code>. If the lexeme isn&#8217;t a reserved keyword, it returns <code>TOKEN_IDENTIFIER</code>.</p><pre><code>TokenType identifierTypeUsingHashMap() {
    int length = scanner.current - scanner.start;
    TokenType type = hashMapGet(&amp;scanner.keywords, scanner.start, length);
    return type;
}</code></pre><p>Let&#8217;s have a look at <code>TokenType hashMapGet()</code> to understand how it works:</p><ol><li><p>It hashes the key, which requires a full traversal of the string</p></li><li><p>Looks up the corresponding bucket in the hash table</p></li><li><p>Due to possible collision, it checks if the key matches the bucket&#8217;s key. This again requires a full traversal of the string. If there&#8217;s a match, it returns the keyword&#8217;s TokenType.</p></li><li><p>If this is a collision, it checks the next value in the bucket&#8217;s linked list.</p></li><li><p>If it reaches the end of the linked list, we know that the key is an identifier, not a keyword.</p></li></ol><pre><code>typedef struct HashNode {
    char* key;
    int keyLength;
    TokenType value;
    struct HashNode* next;
} HashNode;

typedef struct {
    HashNode* table[HASHMAP_CAPACITY];
} HashMap;

TokenType hashMapGet(HashMap* map, const char* key, int length) {
    uint32_t index = hash(key, length);
    HashNode* node = map-&gt;table[index];
    while (node) {
        if (node-&gt;keyLength == length &amp;&amp; strncmp(node-&gt;key, key, length) == 0) {
            return node-&gt;value; 
        }
        node = node-&gt;next;
    }
    return TOKEN_IDENTIFIER;
}</code></pre><p>This seems reasonably fast. Especially if the hash function distributes keys uniformly in the hash table to avoid expensive chaining. We can control this through the load factor (entries/capacity).</p><p>We are traversing the string multiple times, but when we consider the lexeme &#8220;formula&#8221;, once the scanner reads &#8220;form&#8221; it already knows that there is no keyword with the prefix &#8220;form&#8221;. Can we leverage it? That&#8217;s exactly the approach Crafting Interpreters takes.</p><h3>Tries</h3><p>Tries are a type of n-ary tree that can be used to efficiently represent a set of words. Any path from the root to the leaf constitutes a valid word in the vocabulary. Here&#8217;s an illustration for the following keywords: &#8220;false&#8221;, &#8220;for&#8221;, and &#8220;fun&#8221;.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!FUrK!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F95bb03de-d125-437f-bb60-d920e5149c80_133x292.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!FUrK!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F95bb03de-d125-437f-bb60-d920e5149c80_133x292.png 424w, https://substackcdn.com/image/fetch/$s_!FUrK!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F95bb03de-d125-437f-bb60-d920e5149c80_133x292.png 848w, https://substackcdn.com/image/fetch/$s_!FUrK!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F95bb03de-d125-437f-bb60-d920e5149c80_133x292.png 1272w, https://substackcdn.com/image/fetch/$s_!FUrK!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F95bb03de-d125-437f-bb60-d920e5149c80_133x292.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!FUrK!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F95bb03de-d125-437f-bb60-d920e5149c80_133x292.png" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/95bb03de-d125-437f-bb60-d920e5149c80_133x292.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:null,&quot;width&quot;:null,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!FUrK!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F95bb03de-d125-437f-bb60-d920e5149c80_133x292.png 424w, https://substackcdn.com/image/fetch/$s_!FUrK!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F95bb03de-d125-437f-bb60-d920e5149c80_133x292.png 848w, https://substackcdn.com/image/fetch/$s_!FUrK!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F95bb03de-d125-437f-bb60-d920e5149c80_133x292.png 1272w, https://substackcdn.com/image/fetch/$s_!FUrK!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F95bb03de-d125-437f-bb60-d920e5149c80_133x292.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">Trie for keyword&nbsp;lookup</figcaption></figure></div><p>Let&#8217;s represent all keywords with a trie and implement it with a switch-case statement. It&#8217;s pretty simple:</p><ol><li><p>Check if the first letter of the current lexeme is also the first letter of any keyword. This is equivalent to starting from &#8220;Start&#8221; in the tree diagram above. If not, it immediately knows that the lexeme is an identifier without any extra work.</p></li><li><p>If the first character belongs to some keyword, it further checks which keyword it could be or compares the rest of the string if there&#8217;s only one option left.</p></li></ol><pre><code>static TokenType checkKeyword(int start, int length, const char* rest, TokenType type) {
    if (scanner.current - scanner.start == start + length &amp;&amp; 
        memcmp(scanner.start + start, rest, length) == 0) {
        return type;
    } 
    return TOKEN_IDENTIFIER;
}

static TokenType identifierType() {
    switch (scanner.start[0]) {
        case 'a': return checkKeyword(1, 2, "nd", TOKEN_AND);
        case 'c': return checkKeyword(1, 4, "lass", TOKEN_CLASS);
        case 'e': return checkKeyword(1, 3, "lse", TOKEN_ELSE);
        case 'f':
            if (scanner.current - scanner.start &gt; 1) {
                switch (scanner.start[1]) {
                    case 'a': return checkKeyword(2, 3, "lse", TOKEN_FALSE);
                    case 'o': return checkKeyword(2, 1, "r", TOKEN_FOR);
                    case 'u': return checkKeyword(2, 1, "n", TOKEN_FUN);
                }
            }
            break;
        case 'i': return checkKeyword(1, 1, "f", TOKEN_IF);
        case 'n': return checkKeyword(1, 2, "il", TOKEN_NIL);
        case 'o': return checkKeyword(1, 1, "r", TOKEN_OR);
        case 'p': return checkKeyword(1, 4, "rint", TOKEN_PRINT);
        case 'r': return checkKeyword(1, 5, "eturn", TOKEN_RETURN);
        case 's': return checkKeyword(1, 4, "uper", TOKEN_SUPER);
        case 't':
            if (scanner.current - scanner.start &gt; 1) {
                switch(scanner.start[1]) {
                    case 'h': return checkKeyword(2, 2, "is", TOKEN_THIS);
                    case 'r': return checkKeyword(2, 2, "ue", TOKEN_TRUE);
                }
            }
        case 'v': return checkKeyword(1, 2, "ar", TOKEN_VAR);
        case 'w': return checkKeyword(1, 4, "hile", TOKEN_WHILE);
    }

    return TOKEN_IDENTIFIER;
}</code></pre><p><code>TokenType checkKeyword()</code> compares the rest of the two strings with a call to <code>memcmp</code>.</p><p>This has several nice properties. For many identifiers, it can determine that it is not a keyword by looking at just the first character. In comparison, the hash table approach had to hash the lexeme every time.</p><h3>Glue</h3><p>The driving code doesn&#8217;t do anything useful, it simply loads a source file and scans it. Normally it would hand off the Tokens to a parser/compiler, but here I just wanted to focus on the scanner&#8217;s performance.</p><p>It reads the source file to memory and hands it off to the scanner. We time how long it takes the scanner to finish scanning.</p><pre><code>static void runFile(const char* path) {
    char* source = readFile(path);
    initScanner(source);
    // Prime the scanner.
    Token token;
    
    struct timespec start, end;
    clock_gettime(CLOCK_MONOTONIC, &amp;start);
    
    // Scans the source file.
    while (token.type != TOKEN_EOF) {
        advance(&amp;token);
    }

    clock_gettime(CLOCK_MONOTONIC, &amp;end);
    double elapsed = (end.tv_sec - start.tv_sec) + 
                 ((end.tv_nsec - start.tv_nsec)/1e9);
    printf("%f\n", elapsed);
    free(source);
}

int main(int argc, const char* argv[]) {
    if (argc == 2) {
        runFile(argv[1]);
    } else {
        fprintf(stderr, "Usage: clox [path]\n");
        exit(64);
    }
    
    return 0;
}</code></pre><p>For completeness, the <code>advance()</code> function just scans the next token. This <code>scanToken()</code> is another switch-case statement that given a character (with some lookahead) decides what token it is. <code>identifier()</code> is what decides if a lexeme is a keyword or an identifier!</p><pre><code>Token scanToken() {
    skipWhitespace();
    scanner.start = scanner.current;

    if (isAtEnd()) return makeToken(TOKEN_EOF);

    char c = advance();
    // identifier() uses either the hash map or trie to determine the type
    if (isAlpha(c)) return identifier();
    if (isDigit(c)) return number();

    switch (c) {
        case '(': return makeToken(TOKEN_LEFT_PAREN);
        case ')': return makeToken(TOKEN_RIGHT_PAREN);
        case '{': return makeToken(TOKEN_LEFT_BRACE);
        case '}': return makeToken(TOKEN_RIGHT_BRACE);
        case ';': return makeToken(TOKEN_SEMICOLON);
        case ',': return makeToken(TOKEN_COMMA);
        case '.': return makeToken(TOKEN_DOT);
        case '-': return makeToken(TOKEN_MINUS);
        case '+': return makeToken(TOKEN_PLUS);
        case '*': return makeToken(TOKEN_STAR);
        case '/': return makeToken(TOKEN_SLASH);
        case '!':
            return makeToken(
                match('=') ? TOKEN_BANG_EQUAL : TOKEN_BANG);
        case '=':
            return makeToken(
                match('=') ? TOKEN_EQUAL_EQUAL : TOKEN_EQUAL);
        case '&lt;':
            return makeToken(
                match('=') ? TOKEN_LESS_EQUAL : TOKEN_LESS);
        case '&gt;':
            return makeToken(
                match('=') ? TOKEN_GREATER_EQUAL : TOKEN_GREATER);
        case '"': return string();
    }

    return errorToken("Unexpected character.");
}

static void advance(Token* token) {
    for (;;) {
        *token = scanToken();
        if (token-&gt;type != TOKEN_ERROR) break;
    }
}</code></pre><h3>Benchmark</h3><p>I compile the scanner with <code>gcc -O3</code> to enable optimizations. I then run the scanner against source files of varying sizes multiple times (500) to measure its performance. I re-run and recompile with the hashmap implementation and the trie implementation.</p><p>Since we are just scanning, I composed a ~1k line source file from different unit tests. I then created a couple of versions of this source file where I copied it consecutively <code>size</code> times, i.e. <code>size</code> 128 corresponds to ~128k lines of code.</p><pre><code>def execute_command(command):
    result = subprocess.run(command, stdout=subprocess.PIPE)
    return float(result.stdout.strip())

def main():
    results = defaultdict(list)

    for size in [1, 2, 4, 8, 16, 32, 64, 128]:
        for _ in range(500):
            command = ["./a.out", f"code{size}.clox"]
            output = execute_command(command)
            results[size].append(output)

        print(f"size {size}")
        mean = statistics.mean(results[size])
        std_dev = statistics.stdev(results[size])
        print(f"Mean: {mean}")
        print(f"Standard Deviation: {std_dev}")
        print("")

    with open("results_hashmap.pkl", "wb") as file:
        pickle.dump(results, file)

if __name__ == "__main__":
    main()p</code></pre><h3>Results</h3><p>Plotting the results yields the plot below. Vertical bars indicate standard deviation.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!kwyg!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fee3bb6e9-5e62-443a-b7f3-c3b1786852da_800x507.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!kwyg!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fee3bb6e9-5e62-443a-b7f3-c3b1786852da_800x507.png 424w, https://substackcdn.com/image/fetch/$s_!kwyg!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fee3bb6e9-5e62-443a-b7f3-c3b1786852da_800x507.png 848w, https://substackcdn.com/image/fetch/$s_!kwyg!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fee3bb6e9-5e62-443a-b7f3-c3b1786852da_800x507.png 1272w, https://substackcdn.com/image/fetch/$s_!kwyg!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fee3bb6e9-5e62-443a-b7f3-c3b1786852da_800x507.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!kwyg!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fee3bb6e9-5e62-443a-b7f3-c3b1786852da_800x507.png" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ee3bb6e9-5e62-443a-b7f3-c3b1786852da_800x507.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:null,&quot;width&quot;:null,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!kwyg!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fee3bb6e9-5e62-443a-b7f3-c3b1786852da_800x507.png 424w, https://substackcdn.com/image/fetch/$s_!kwyg!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fee3bb6e9-5e62-443a-b7f3-c3b1786852da_800x507.png 848w, https://substackcdn.com/image/fetch/$s_!kwyg!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fee3bb6e9-5e62-443a-b7f3-c3b1786852da_800x507.png 1272w, https://substackcdn.com/image/fetch/$s_!kwyg!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fee3bb6e9-5e62-443a-b7f3-c3b1786852da_800x507.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>As expected, both approaches scale linearly with the size of the source file.&nbsp;<br>Tries are on average faster by about 30%. That&#8217;s a pretty sizeable difference, especially since modern IDEs re-run the scanner and parser whenever the user makes any changes to provide syntax highlighting and type hints.</p><p>In the test source file, I used pretty standard-length identifier names. If used very long ones, the Trie implementation would show even better results, since it can avoid unnecessarily traversing the full string.</p><p>Since the trie implementation is so simple, there&#8217;s not much of a reason not to use it.</p><div><hr></div><p>Thanks for reading! If you liked this, consider subscribing and/or following me on <a href="https://www.linkedin.com/in/michal-pitr-a7156b127/">LinkedIn</a>. I like to understand the inner workings of software systems. Whenever I learn something interesting, I compile it into one of these blog posts!</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://michalpitr.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Michal&#8217;s Substack! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p>Want to read another one? Consider checking out one of my other posts!</p><div class="digest-post-embed" data-attrs="{&quot;nodeId&quot;:&quot;3c58e4d0-de42-4e1e-a99a-c54f3ee01626&quot;,&quot;caption&quot;:&quot;One of the great joys of software engineering is dispelling magic. I&#8217;ve written code that executed on a GPU using frameworks like PyTorch or TensorFlow, but I never understood the &#8220;how&#8221;. It&#8217;s time to dispel the magic of GPU programming and learn how it works under the hood.&quot;,&quot;cta&quot;:null,&quot;showBylines&quot;:true,&quot;size&quot;:&quot;sm&quot;,&quot;isEditorNode&quot;:true,&quot;title&quot;:&quot;GPU Programming&quot;,&quot;publishedBylines&quot;:[{&quot;id&quot;:28235731,&quot;name&quot;:&quot;Michal Pitr&quot;,&quot;bio&quot;:&quot;I write deep dives into software engineering topics.&quot;,&quot;photo_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/73a6e774-038e-424d-afee-ce5041c3e7e0_1080x1080.jpeg&quot;,&quot;is_guest&quot;:false,&quot;bestseller_tier&quot;:null}],&quot;post_date&quot;:&quot;2024-05-04T19:25:04.223Z&quot;,&quot;cover_image&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F60744dcf-3477-4aa5-9cdb-f020a0ba874c_2472x888.png&quot;,&quot;cover_image_alt&quot;:null,&quot;canonical_url&quot;:&quot;https://michalpitr.substack.com/p/gpu-programming&quot;,&quot;section_name&quot;:null,&quot;video_upload_id&quot;:null,&quot;id&quot;:144305968,&quot;type&quot;:&quot;newsletter&quot;,&quot;reaction_count&quot;:3,&quot;comment_count&quot;:0,&quot;publication_id&quot;:null,&quot;publication_name&quot;:&quot;Michal&#8217;s Substack&quot;,&quot;publication_logo_url&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbe5ab36b-c782-420e-ac35-c7599a1f77ad_976x976.png&quot;,&quot;belowTheFold&quot;:true,&quot;youtube_url&quot;:null,&quot;show_links&quot;:null,&quot;feed_url&quot;:null}"></div><div class="digest-post-embed" data-attrs="{&quot;nodeId&quot;:&quot;ab6e9865-ea36-4389-ba2c-5cd0725f7559&quot;,&quot;caption&quot;:&quot;Over the last couple of weeks, I&#8217;ve been building MapReduce from scratch.&quot;,&quot;cta&quot;:null,&quot;showBylines&quot;:true,&quot;size&quot;:&quot;sm&quot;,&quot;isEditorNode&quot;:true,&quot;title&quot;:&quot;MapReduce from Scratch&quot;,&quot;publishedBylines&quot;:[{&quot;id&quot;:28235731,&quot;name&quot;:&quot;Michal Pitr&quot;,&quot;bio&quot;:&quot;I write deep dives into software engineering topics.&quot;,&quot;photo_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/73a6e774-038e-424d-afee-ce5041c3e7e0_1080x1080.jpeg&quot;,&quot;is_guest&quot;:false,&quot;bestseller_tier&quot;:null}],&quot;post_date&quot;:&quot;2024-04-28T21:33:39.100Z&quot;,&quot;cover_image&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc1d3a7a5-e2cc-4778-bafe-03f87e6a6884_1600x671.png&quot;,&quot;cover_image_alt&quot;:null,&quot;canonical_url&quot;:&quot;https://michalpitr.substack.com/p/mapreduce-from-scratch&quot;,&quot;section_name&quot;:null,&quot;video_upload_id&quot;:null,&quot;id&quot;:144104758,&quot;type&quot;:&quot;newsletter&quot;,&quot;reaction_count&quot;:10,&quot;comment_count&quot;:4,&quot;publication_id&quot;:null,&quot;publication_name&quot;:&quot;Michal&#8217;s Substack&quot;,&quot;publication_logo_url&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbe5ab36b-c782-420e-ac35-c7599a1f77ad_976x976.png&quot;,&quot;belowTheFold&quot;:true,&quot;youtube_url&quot;:null,&quot;show_links&quot;:null,&quot;feed_url&quot;:null}"></div><div class="digest-post-embed" data-attrs="{&quot;nodeId&quot;:&quot;2bcefcef-8ae0-49e4-963a-1f56a8e46ee3&quot;,&quot;caption&quot;:&quot;I like to keep things practical. Let&#8217;s train a simple neural network, save the model, and write an inference engine that can execute inputs against the model. Sounds like a fun time to me!&quot;,&quot;cta&quot;:null,&quot;showBylines&quot;:true,&quot;size&quot;:&quot;sm&quot;,&quot;isEditorNode&quot;:true,&quot;title&quot;:&quot;Build Your Own Inference Engine: From Scratch to \&quot;7\&quot;&quot;,&quot;publishedBylines&quot;:[{&quot;id&quot;:28235731,&quot;name&quot;:&quot;Michal Pitr&quot;,&quot;bio&quot;:&quot;I write deep dives into software engineering topics.&quot;,&quot;photo_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/73a6e774-038e-424d-afee-ce5041c3e7e0_1080x1080.jpeg&quot;,&quot;is_guest&quot;:false,&quot;bestseller_tier&quot;:null}],&quot;post_date&quot;:&quot;2024-08-04T15:27:57.810Z&quot;,&quot;cover_image&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6e6e7b0d-48b1-4dc0-886c-96b2d844f2dc_1340x1106.png&quot;,&quot;cover_image_alt&quot;:null,&quot;canonical_url&quot;:&quot;https://michalpitr.substack.com/p/build-your-own-inference-engine-from&quot;,&quot;section_name&quot;:null,&quot;video_upload_id&quot;:null,&quot;id&quot;:147338023,&quot;type&quot;:&quot;newsletter&quot;,&quot;reaction_count&quot;:8,&quot;comment_count&quot;:0,&quot;publication_id&quot;:null,&quot;publication_name&quot;:&quot;Michal&#8217;s Substack&quot;,&quot;publication_logo_url&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbe5ab36b-c782-420e-ac35-c7599a1f77ad_976x976.png&quot;,&quot;belowTheFold&quot;:true,&quot;youtube_url&quot;:null,&quot;show_links&quot;:null,&quot;feed_url&quot;:null}"></div><div class="digest-post-embed" data-attrs="{&quot;nodeId&quot;:&quot;3fd9a84d-ea1a-438d-b68f-eb988656ff35&quot;,&quot;caption&quot;:&quot;Recently I&#8217;ve been implementing a subset of SQLite (the world&#8217;s most used database, btw) from scratch in Go. I&#8217;ll share what I&#8217;ve learned about how SQLite stores data on disk which will help us understand key database concepts. Thanks for reading Michal&#8217;s Substack! Subscribe for free to receive new posts and support my work.&quot;,&quot;cta&quot;:null,&quot;showBylines&quot;:true,&quot;size&quot;:&quot;sm&quot;,&quot;isEditorNode&quot;:true,&quot;title&quot;:&quot;How does SQLite store data?&quot;,&quot;publishedBylines&quot;:[{&quot;id&quot;:28235731,&quot;name&quot;:&quot;Michal Pitr&quot;,&quot;bio&quot;:&quot;I write deep dives into software engineering topics.&quot;,&quot;photo_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/73a6e774-038e-424d-afee-ce5041c3e7e0_1080x1080.jpeg&quot;,&quot;is_guest&quot;:false,&quot;bestseller_tier&quot;:null}],&quot;post_date&quot;:&quot;2024-03-17T16:50:48.826Z&quot;,&quot;cover_image&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc850055-defb-4bdb-8da8-779e45b482f5_997x387.png&quot;,&quot;cover_image_alt&quot;:null,&quot;canonical_url&quot;:&quot;https://michalpitr.substack.com/p/how-does-sqlite-store-data&quot;,&quot;section_name&quot;:null,&quot;video_upload_id&quot;:null,&quot;id&quot;:142692526,&quot;type&quot;:&quot;newsletter&quot;,&quot;reaction_count&quot;:16,&quot;comment_count&quot;:0,&quot;publication_id&quot;:null,&quot;publication_name&quot;:&quot;Michal&#8217;s Substack&quot;,&quot;publication_logo_url&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbe5ab36b-c782-420e-ac35-c7599a1f77ad_976x976.png&quot;,&quot;belowTheFold&quot;:true,&quot;youtube_url&quot;:null,&quot;show_links&quot;:null,&quot;feed_url&quot;:null}"></div>]]></content:encoded></item></channel></rss>