I also realised that we can compact the TID array in step (b) above because we only need to store 17 bits for block numbers (we already know which 1GB segment they belong to). Given that usable offsets are also just 13 bits, TID array needs only 4 bytes per TID instead of 6.
Actually this seems like a clear savings of at least 30% for all use cases, at the cost of allocating in smaller chunks and doing some transformations. But given the problem we are trying to solve, seems like a small price to pay.