Adding two integers in CPython

Ignoring cheap operations (arithmetic, logical, bits) and some not-so-cheap operations (branching), this is the operational cost of adding two integers in CPython.

  1. 1 (hot) indirect read: read bytecode buffer. (NEXTOP macro)

  2. 1 jump-table jump: instruction selection (BINARY_ADD in this case).

  3. 2 (hot) indirect reads: pop the evaluation stack. (POP macro, TOP macro)

  4. 2 indirect reads: RTTI check for int type. (PyInt_CheckExact macro)

  5. 2 indirect reads: extract int value. (PyInt_AS_LONG macro)

  6. 1 unsigned long add: the actual addition (maybe).

  7. 1 overflow check. Four paths:

    1. “Fast” path: (cached int object)

      1. 1 range check: determine if value is cached.

      2. 1 (warm) global read: read the cached object.

      3. 1 indirect increment: reference count.

    2. “Slow” path: (not cached; int object ‘free list’ not full)

      1. 1 global read: use free list.

      2. 1 global write: advance free list.

      3. 1 indirect write: set object type. (PyObject_INIT macro)

      4. 1 indirect write: initialize reference count. (_Py_NewReference macro)

      5. 1 indirect write: set integer value.

    3. “Slower” path: (not cached; free list full / not allocated)

      1. 1 memory allocation: allocate new PyIntObject free list.

      2. 1 indirect write: chain new free list to current (full) free list.

      3. 1 global write: set new free list as current free list.

      4. 1000 (hot) indirect writes: link PyIntObjects to each other.

      5. Go to “Slow” path.

    4. “Slowest” path: (addition overflow)

      1. 2 indirect reads: RTTI check.

      2. 1 double-indirect reads: read number method table.

      3. 1 triple-indirect reads: read addition method from table. (NB_BINOP)

      4. 1 quadruple-indirect call: go to int addition. (int_add)

      5. 2 indirect reads: RTTI check for int type. (CONVERT_TO_LONG)

      6. 2 indirect reads: extract int value.

      7. 1 unsigned long add: the actual addition (again).

      8. 1 overflow check (again).

      9. 1 global read: read PyLong_Type number method table.

      10. 1 indirect read: read long addition method.

      11. 1 doubly-indirect call: go to long addition. (long_add)

      12. Convert ints to longs: (CONVERT_BINOP macro)

        1. 2 indirect reads: RTTI check for int type.

        2. 2 indirect reads: extract int value (third time). (PyInt_AS_LONG)

        3. Allocate temporary long objects: (in PyLong_FromLong)

          1. Quite extensive, not going into detail.
      13. 1 object allocation: for addition result.

        1. Quite extensive, not going into detail.
      14. 4* indirect reads & writes: arbitrary-precision addition.

      15. 2 indirect decrements: refcount for temporary longs. (Py_DECREF)

      16. 2 indirect function calls: object deallocation. (_Py_Dealloc)

        1. Quite extensive, not going into detail.
  8. 2 indirect decrements: refcount for popped stack values. (Py_DECREF)

  9. (Possible) 2 indirect function calls: object deallocation. (_Py_Dealloc)

    1. (Returns int to free list)
  10. 1 (hot) indirect write: set top of evaluation stack. (SET_TOP macro)

Rotate a Square Matrix

(Referring to this leetcode problem.)

Pictorally, my approach consists of:

Given a NxN (N=6 here) matrix, number its four corners:

1 . . . . 2
. . . . . .
. . . . . .
. . . . . .
. . . . . .
4 . . . . 3

Rotate elements at positions 1, 2, 3, 4 clockwise:

4 . . . . 1
. . . . . .
. . . . . .
. . . . . .
. . . . . .
3 . . . . 2

Next, number their clockwise-immediate elements:

# 1 . . . #
. . . . . 2
. . . . . .
. . . . . .
4 . . . . .
# . . . 3 #

(Already dealt with elements are marked with #.)

Repeat the previous steps N-1 times (in total):

# # 1 . . #        # # # 1 . #
. . . . . #        . . . . . #
. . . . . 2  --->  4 . . . . #
4 . . . . .  --->  # . . . . 2
# . . . . .        # . . . . .
# . . 3 # #        # . 3 # # #
                //
              //
             LL 
# # # # 1 #        # # # # # #
4 . . . . #        # . . . . #
# . . . . #  --->  # . . . . #
# . . . . #  --->  # . . . . #
# . . . . 2        # . . . . #
# 3 # # # #        # # # # # #

Notice how we moved through the ‘outer layer’ of the matrix in a ‘spiral’ shape, and its elements were rotated thusly.

By applying the same procedure to the inner layers, we arrive at a fully rotated matrix.

The code follows:

struct Pos { int i, j; };

void rotate(vector<vector<int>>& m) {
  // for every k-th layer:
  for (int k = 0; k < m.size() / 2; ++k) {
    // it has n*n size:
    int n = m.size() - k * 2;
    // apply our 'spiral' rotating procedure n-1 times:
    for (int s = 0; s < n - 1; ++s) {
      rotateLayer(m, k, n, s);
    }
  }
}

void rotateLayer(vector<vector<int>> &m, int layer, int size, int step) {
  int k = layer;
  int n = size;
  int s = step;

  // top-left corner,  moving right   (j += s)
  Pos p1 { k,             k + s };
  // top-right corner, moving down    (i += s)
  Pos p2 { k + s,         k + n - 1 };
  // bottom-right corner, moving left (j -= s)
  Pos p3 { k + n - 1,     k + n - 1 - s };
  // bottom-left corner,  moving up   (i -= s)
  Pos p4 { k + n - 1 - s, k };
  
  // swap clockwise
  tie(at(m, p1), at(m, p2), at(m, p3), at(m, p4))
    = make_tuple(at(m, p4), at(m, p1), at(m, p2), at(m, p3));
}

int& at(vector<vector<int>>& m, Pos p) {
  return m[p.i][p.j];
}

Note that this algorithm is not exactly cache-friendly (:

RE: The long arrow operator in C++

I came accross Ivan Čukić’s post on the long arrow operator for C++, where he jokingly proposes the use of --->, ----->, etc. for dereferencing nested pointer-like structures.

Like most jokes, there’s a nugget of truth behind it, and I decided to tackle the problem of nested structures with my own crazy C++.

My take allows the user to write the following code:

auto hello1(wrap<vec2> w) {
    return w->x + w->y;
}

auto hello2(wrap<wrap<vec2>> w) {
    return w->x + w->y;
}

auto hello3(wrap<wrap<wrap<vec2>>> w) {
    return w->x + w->y;
}

// ... etc

Essentially, the -> collapses an arbitrary number of wrap applications into a single one. But how?

Here’s how.

template <typename T>
struct wrap {
    auto operator->() {
        if constexpr (is_wrap_v<T>) {  // impl of is_wrap_v below
            return value.operator->();
        } else {
            return &value;
        }
    }

    T value;
};

We make use of C++17’s if constexpr feature to statically determine whether we’ve reached a “pure” value or we need to continue unwrapping.

The full code can be found here, and thanks to the amazing Compiler Explorer you can try it for yourself here.

One possible generalization is to replace is_wrap_v<T> with a has_operator_arrow_v<T> which returns true if T provides operator->.

Modern C++ is fun!

Pascal's triangle in Haskell

(My answer to this Stack Overflow question.)

Start out with the triangle itself:

     1
    1 1
   1 2 1
  1 3 3 1
 1 4 6 4 1
    ...

You should notice that to write down the next row, you must apply this rule: sum the previous rows’ adjacent elements, using a 0 for the lonely edge elements. Visually:

    0   1   0
     \+/ \+/
  0   1   1   0
   \+/ \+/ \+/
0   1   2   1   0
 \+/ \+/ \+/ \+/
  1   3   3   1
       ...

Operationally, that looks like this:

For row 0:
[1]  (it's a given; i.e. base case)

For row 1:
[0, 1]   <- row 0 with a zero prepended ([0] ++ row 0)
 +  +
[1, 0]   <- row 0 with a zero appended  (row 0 ++ [0])
 =  =
[1, 1]   <- element-wise addition

For row 2:
[0, 1, 1]
 +  +  +
[1, 1, 0]
 =  =  =
[1, 2, 1]

Generally, for row N:

element-wise addition of:
  [0] ++ row(N-1)
  row(N-1) ++ [0]

Remember that element-wise addition of lists in Haskell is zipWith (+).

Thus we arrive at the following Haskell definition:

pascal 0 = [1]
pascal n = zipWith (+) ([0] ++ pascal (n-1)) (pascal (n-1) ++ [0])

Or in a fashion similar to the famous “lazy fibs”:

pascals = [1] : map (\xs -> zipWith (+) ([0] ++ xs) (xs ++ [0])) pascals

Binary search on the integer number line

Consider a predicate p(x) such that:

  • forall x in [1..k-1]: p(x)
  • forall x in [k..n]: !p(x)

In other words, [1..n] is partitioned by p(x) at k.

In problems such as this one (First Bad Version), we are asked to implement an algorithm analogous to git’s bisect command. Based on the problem statement, it is straightforward to model it as the above partitioning scheme.

Then there’s problems such as this other one (Arranging Coins) where the connection to partitioning is not exactly obvious, but it’s still there.

A naive approach to finding k is to iterate over [1..n] sequentially and stop when p(x) no longer holds. The time-complexity of this algorithm is O(n) (ignoring the cost of computing p(x)).

A more efficient approach is to perform a binary search over [1..n]. This brings down the time-complexity to O(lg n). Visually:

     [xxxxxxxxxxooo]
     /     /\      \
    /     /  \      \
   /     /    \      \
a:[xxxxxx]  b:[xxxxooo]

p(x) holds for a's last element.
This means that the partition point (k) is not in this range.
We are safe to discard range `a`, halving our search space.

The algorithm works and it’s quite efficient, but implementing binary search can be quite a hassle. I think that playing around with indices is a pain, and it’s very easy to forget when to +1 something or when not to do it. I can rarely get it right on the first try.

When implementing something is too costly, we must first turn our attention to libraries. Thankfully, binary search is a common enough algorithm that it’s not a big surprise to find an implementation in the standard library of our language of choice.

For example, C++’s algorithm library provides the lower_bound operation with the following signature:

template< class ForwardIt, class T >
ForwardIt lower_bound( ForwardIt first, ForwardIt last, const T& value );

// -- or --

template< class ForwardIt, class T, class Compare >
ForwardIt lower_bound( ForwardIt first, ForwardIt last, const T& value, Compare comp );

However, that’s not quite the droid we’re looking for. The function assumes that we know the element we’re looking for and only tries to find its position (roughly) in the range. Although we could jump through some hoops to make lower_bound work, C++11 has sneaked in a function that better suits our needs: partition_point. Take a look at its signature and description:

template< class ForwardIt, class UnaryPredicate >
ForwardIt partition_point( ForwardIt first, ForwardIt last, UnaryPredicate p );

Examines the partitioned (as if by std::partition) range [first, last) and locates the end of the first partition, that is, the first element that does not satisfy p or last if all elements satisfy p.

That’s exactly what we need… or is it?

If we decide to reify the [1..n] range into some collection (e.g. a std::vector<int>), then yes it is. But that’s an obvious waste of space.

Can we do away with reifying the range? Yes we can, but it’s not pretty. We need to implement RandomAccessIterator by essentially lifting int into its interface:

struct number_iter {
    int val;
    bool operator ==(number_iter rhs) {
        return val == rhs.val;
    }
    bool operator !=(number_iter rhs) {
        return val != rhs.val;
    }
    number_iter& operator++() {
        ++val;
        return *this;
    }
    int operator*() {
        return val;
    }
    int operator-(number_iter rhs) {
        return val - rhs.val;
    }
    number_iter& operator+=(int x) {
        val += x;
        return *this;
    }
};

// This is the tricky part:

namespace std {
    template <>
    struct iterator_traits<number_iter> {
        using difference_type = int;
        using iterator_category = random_access_iterator_tag;
    };
}

Now we can solve First Bad Version like so:

auto pred = [](int x) { return !isBadVersion(x+1); };
auto it = std::partition_point(number_iter{0}, number_iter{n}, pred);
return *it + 1;

And similarly with Arranging Coins:

if (n == 1) return 1;
auto pred = [n](long long x) { return x*(x+1)/2 <= n; };
auto it = std::partition_point(number_iter{0}, number_iter{n}, pred);
return *it - 1;

Okay, okay. I agree that this approach is suboptimal in terms of implementability. You could argue that it’s harder to implement than binary search. But, hey, at least we learned a couple of things about the STL.

Don’t forget to carefully scan through the C++ algorithm library.

Edit (2017-07-16):

Yet another problem: Single Element in a Sorted Array.

int singleNonDuplicate(vector<int>& nums) {
    int n = nums.size();
    auto pred = [&nums] (int i) { return nums[2*i] == nums[2*i+1]; };
    auto it = partition_point(number_iter{0}, number_iter{n/2}, pred);
    return nums[2 * (*it)];
}