# Mastering binary and bitwise in PHP

I recently caught myself working on different projects that required me to rely heavily on bitwise operations in PHP. From reading binary files to emulating processors, this is a very useful knowledge to have and a very cool one too.

PHP has many tools to support you with manipulating binary data, but I must warn you from the beginning: if you're seeking ultra low-level efficiency, this isn't the language for you.

Bare with me, though! **In this post I'll show you very valuable things
about bitwise operations, binary and hexadecimal handling that will be
useful for you in ANY language.**

This article grew quit a bit, so I'll leave here a quick summary so you can easily navigate to the sections you'd like.

- Why PHP might not be the best candidate
- Quick introduction to binary and hexadecimal data representations
- Carry operations
- Data representation in computer memory
- Arithmetic Overflows
- Binary numbers and strings in PHP
- Binary: Integers or Strings, which to use in PHP?
- Debugging binary values in PHP
- Visualizing binary strings
- Bitwise Operations
- What is a bitmask
- Normalizing integers
- Conclusion and examples

## Why PHP might not be the best candidate

Look. I love PHP, ok? Don't get me wrong. And I'm sure it will handle gracefully many more cases than you can imagine. But in cases where you need to be very efficient while handling binary data, PHP simply won't do the job.

Just to be clear: I'm not talking about how an application might consume 5 or 10mb more, I'm talking about allocating the exact amount of memory necessary to hold a certain data type.

According to the official documentation on integers , PHP represents decimals as well as hexadecimals, octals and binaries with the type integer. So it doesn't really matter what data you put in there, it will always be an integer.

You probably heard of *ZVAL* before, it is this C struct that represents
every PHP variable. It has
a field to represent all integers called zend_long.
As you can see, zend_long is of type *lval* which has a platform-dependent
size: On 64-bit platforms
it will be represented as a 64bit integer,
while 32-bit platforms represent it as a 32bit integer.

```
# zval stores every integer as a lval
typedef union _zend_value {
zend_long lval;
// ...
} zend_value;
# lval is a 32 or 64-bit integer
#ifdef ZEND_ENABLE_ZVAL_LONG64
typedef int64_t zend_long;
// ...
#else
typedef int32_t zend_long;
// ...
#endif
```

Bottomline is: doesn't matter if you need to store `0xff`

, `0xffff`

,
`0xffffff`

or whatever. They will all be stored as long (*lval*) with
32 or 64 bits in PHP.

I recently played around, for example, with microcontrollers emulation. And while handling memory and operations properly is a must, I didn't really need so much memory efficiency there because my host machine compensates it in orders of magnitude.

Of course everything changes when you talk about C Extensions or FFI, but that's not my point. I'm talking about pure PHP.

So keep this in mind: it works and it can achieve all behaviour you'd like it to achieve, but types won't fit efficiently in most cases.

## Quick introduction to binary and hexadecimal data representations

Look, before we talk about how PHP handles binary data we must detour a little and talk about binary stuff first. If you think you already know everything you need about this, just jump to the Binary numbers and strings in PHP section.

There's this thing in math called "base". It defines how we may represent
quantities in different formats. Us, humans, normally use the decimal
base (base 10) which allows any number to be represented with the digits
`0`

, `1`

, `2`

, `3`

, `4`

, `5`

, `6`

, `7`

, `8`

and `9`

.

To make our next examples clearer I will call the number "20" as "decimal 20".

Binary numbers (base 2) can represent any number, but using only two
distinct digits: `0`

and `1`

.

The decimal 20 when represented in binary form, can be seen as 0b000**10100**.
Do not worry about converting it, let the machines do this job 😉

Hexadecimal numbers (base 16) can represent any number and, to do so, it uses
not only the ten digits `0`

, `1`

, `2`

, `3`

, `4`

, `5`

, `6`

, `7`

, `8`

and `9`

but
also additional six characters borrowed from the latin alphabet: `a`

, `b`

, `c`

,
`d`

, `e`

and `f`

.

The same decimal 20 is represented in hexadecimal as the number `0x14`

. Again,
don't try to convert this to decimal in your head, our computers are experts in it!

**What is important for you to understand is that numbers can be represented in
different bases:** binary (base 2), octal (base 8), decimal (base 10, our common
base) and hexadecimal (base 16).

In PHP and many other languages, **binary numbers** are written as any other numbers
but with a **prefix 0b**, like the decimal 20 represented as **0b**00010100. **Hexadecimal**
numbers receive a **prefix 0x**, like the decimal 20 represented as **0x**14.

As you might have heard already, computers don't store literal data. They represent everything as binary numbers instead: 0s and 1s. Characters, numbers, symbols, instructions... everything is represented using base 2. Characters are just a convention of number sequences: the character 'a', for example, is the number 97 in the ASCII table.

Even though everything is stored as binary, the most convenient way for programmers to read this data is using hexadecimals. They just look good. I mean, look at this!

```
# string "abc"
'abc'
# binary form (bleh)
0b01100001 0b01100010 0b01100011
# hexadecimal form (such wow)
0x61 0x62 0x63
```

While binary takes up lots of visual space, hexadecimals are very neat to represent binary data. That's why we normally stick with them when doing low-level programming.

## Carry operations

You're already familiar with the concept of Carry, but I need you to pay attention to it so we can use it with different bases.

With the decimal set we have ten distinct digits to represent numbers from zero (0) to nine (9). But whenever we try to represent numbers bigger than 9 we run out of digits! Thus a Carry operation happens: we prefix our number with the digit one (1) and reset the right digit to zero (0).

```
# decimal (base 10)
1 + 1 = 2
2 + 2 = 4
9 + 1 = 10 // <- Carry
```

The binary base will have similar behaviour, but is limited to digits 0 and 1.

```
# binary (base 2)
0 + 0 = 0
0 + 1 = 1
1 + 1 = 10 // <- Carry
1 + 10 = 11
```

The same happens with hexadecimal base, but with a much wider range.

```
# hexadecimal (base 16)
1 + 9 = a // no carry, a is in range
1 + a = b
1 + f = 10 // <- Carry
1 + 10 = 11
```

As you realized, carry operations demand more digits to represent a certain number. This allows you to understand how certain data types are limited and, as they're stored in computers, their limitation is represented in binary form.

## Data representation in computer memory

As I mentioned before, computers store everything using binary format. So only 0s and 1s are effectively stored.

The easiest way to visualize how they are stored, is by imagining a big table with a single row and many columns (as many as storage capacity), where each column is a binary digit (bit).

Representing our decimal 20 in such table using only 8 bits, looks like the following:

Position (Address) | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
---|---|---|---|---|---|---|---|---|

Bit | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 |

An unsigned 8-bit Integer is a number that can only be represented with
at most 8 binary digits. So **0b11111111** (decimal 255) is the biggest number an
unsigned 8-bit integer can store. Adding 1 to it would require a Carry operation,
which cannot be represented with the same amount of digits.

With this in mind we can easily understand why there are so many memory representations for numbers and what they effectively are: uint8 is an unsigned 8-bit Integer (decimal 0 to 255), uint16 is an unsigned 16-bit Integer (decimal 0 to 65,535). There are also uint32, uint64 and theoretically higher ones.

Signed integers, which can represent negative values too, normally use the very last bit to determine whether a number is positive (last bit = 0) or negative (last bit = 1). As you can imagine, they then are capable of storing smaller values with the same amount of memory. A signed 8-bit integer will range from decimal -128 until decimal 127.

Here's a decimal -20 represented as a signed 8-bit integer. Notice its first bit (address 0) is set (equals to 1), this marks the number as negative.

Position (Address) | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
---|---|---|---|---|---|---|---|---|

Bit | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 0 |

I hope everything is making sense so far. This introduction is very important for you to understand how computers work internally. Only then you'll feel comfortable with what PHP is actually doing under the hood, we'll have to always keep it in mind.

## Arithmetic Overflows

The way numbers are chosen to be represented (8-bits, 16-bits...) will determine their minimum and maximum value range. And that's basically because of how they are stored in memory: adding 1 to a binary digit 1 should result in a Carry operation, meaning another bit is necessary to prefix the actual number.

Since integer formats are very well defined it is not possible to rely on Carry operations that go above that limit. (IT IS actually possible, but a little insane)

Position (Address) | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
---|---|---|---|---|---|---|---|---|

Bit | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 |

Here we are very close to the 8-bit limit (decimal 255). If we add one to it, we'll end up with the decimal 255 and the following binary representation:

Position (Address) | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
---|---|---|---|---|---|---|---|---|

Bit | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |

All bits are set! Adding 1 to this would require a Carry operation, which cannot happen
because we don't have enough bits: all 8 bits are set! This results in a thing called
**overflow**, which happens when you try to go above a certain limit. The binary operation
255 + 2 should result in 1 when you read its 8-bit result.

Position (Address) | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
---|---|---|---|---|---|---|---|---|

Bit | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |

This behaviour is not random, there's a calculation involved there to determine what's the new value which is not relevant here.

## Binary numbers and strings in PHP

Ok, back to PHP! Sorry about the big detour, but I think it was necessary.

I hope by now dots are starting to connect in your head: binary numbers, how they are stored, what an overflow is, how php represents numbers...

The decimal 20 represented in a PHP integer may have two different representations, depending on your platform. The x86 platform represents it with 32 bits while the x64 does it with 64 bits, both of them are signed (allowing negatives). We all know that decimal 20 can fit in a 8-bit space, but PHP treats every decimal number as a 32 or 64 bits number.

PHP also has binary strings which can be converted back and forth by using the pack() and unpack() functions.

The main difference between binary strings and numbers in PHP is that binary strings are just holding the data, like a buffer. While PHP integers (binary or not) let us perform arithmetic operations on them such as sum and subtraction, and also binary (bitwise) operations such as AND, OR, XOR and NOT.

## Binary: Integers or Strings, which to use in PHP?

To transport data we normally use binary strings. So reading a binary file or network communication will require us to pack and unpack our binary strings.

Actual operations such as OR and XOR cannot reliably happen on strings, so we must use them with integers.

## Debugging binary values in PHP

Now comes the fun! Let's get our hands dirty and play a bit with some PHP code!

The first thing I will show you is to visualize the data. We need to understand what we're dealing with afterall.

Debugging integers is actually very very simple, we can just use the sprintf() function. Its formatting is very powerful and will help us to quickly realize what those values are.

Below I will represent the decimal 20 in a 8-bit binary format and 1-byte hexadecimal format.

```
<?php
// Decimal 20
$n = 20;
echo sprintf('%08b', $n) . "\n";
echo sprintf('%02X', $n) . "\n";
// Output:
00010100
14
```

The format `%08b`

makes the variable `$n`

to be printed as a binary representation (b)
with 8 digits (08).

The format `%02X`

represents the variable `$n`

in hexadecimal (X) and 2 digits (02).

### Visualizing binary strings

While PHP integers are always 32 or 64 bits long, strings are as long as their content. To decode their binary values and visualize what's going on we need to inspect and convert each byte.

Luckily PHP strings are dereferencable just as arrays are, and each position points to a char with 1 byte size. Here's a quick example of how chars can be accessed:

```
<?php
$str = 'thephp.website';
echo $str[3];
echo $str[4];
echo $str[5];
// Outputs:
php
```

Trusting that each char is 1 byte, we can easily call the ord() function to cast it to a 1-byte integer. Like this:

```
<?php
$str = 'thephp.website';
$f = ord($str[3]);
$s = ord($str[4]);
$t = ord($str[5]);
echo sprintf(
'%02X %02X %02X',
$f,
$s,
$t,
);
// Outputs:
70 68 70
```

We can see we're in a good path by double checking with the command line application hexdump:

```
$ echo 'php' | hexdump
// Outputs
0000000 70 68 70 ...
```

Where the first column is the address only, from the second column on we see hexadecimal
values representing the chars `p`

, `h`

and `p`

.

Additionally we may use the pack() and unpack() functions when handling binary strings and I have a great example for you right here!!

Let's say we want to read a JPEG file to fetch some of its data (like EXIF, for example). We may open the file handle using the read binary mode. Let's do this and immediately read the first 2 bytes:

```
<?php
$h = fopen('file.jpeg', 'rb');
// Read 2 bytes
$soi = fread($h, 2);
```

In order to fetch these values into an integer array we can simply unpack them like this:

```
$ints = unpack('C*', $soi);
var_dump($ints);
// Outputs
array(2) {
[1] => int(-1)
[2] => int(-40)
}
echo sprintf('%02X', $ints[1]);
echo sprintf('%02X', $ints[2]);
// Outputs
FFD8
```

Note that the format `C`

in the unpack() function will decode a char in the string `$soi`

as unsigned 8-bit numbers. The star modified `*`

makes it unpack the entire string.

## Bitwise Operations

PHP implements all bitwise operations one might need. They are built as expressions and their results are described below:

PHP Code | Name | Description |
---|---|---|

$x | $y | Inclusive Or | A value with all bits set in both $x and $y |

$x ^ $y | Exclusive Or | A value with bits set in $x or $y but never both |

$x & $y | And | A value with bits set in $x and $y at the same time only |

~$x | Not | Flips all bits in $x |

$x << $y | Left Shift | Shifts the bits of $x to the left $y times |

$x >> $y | Right Shift | Shifts the bits of $x to the right $y times |

I'll explain one by one how they work, do not worry!

Let's assume that `$x = 0x20`

and `$y = 0x30`

. The examples below will present them using binary
notation to make things clearer.

### How Inclusive Or (`$x | $y`

) works

The inclusive Or operation will produce a result taking all bits set from both inputs. So the
operation `$x | $y`

must return `0x30`

. See what's going on below:

```
// 1 | 1 = 1
// 1 | 0 = 1
// 0 | 0 = 0
0b00100000 // $x = 0x20
0b00110000 // $y = 0x30
OR ------- // $x | $y
0b00110000 // 0x30
```

**Notice:** from right to left, the 6th bit of $x was set (equals to 1) while the 5th and 6th
bits of $y were also set. The result merges both and generates a value with bits 5 and 6
set: `0x30`

.

### How Exclusive Or (`$x ^ $y`

) works

The exclusive Or (also known as Xor) will only capture bits that exist in a single side.
So the result of `$x ^ $y`

is `0x10`

. See the example below:

```
// 1 ^ 1 = 0
// 1 ^ 0 = 1
// 0 ^ 0 = 0
0b00100000 // $x = 0x20
0b00110000 // $y = 0x30
XOR ------ // $x ^ $y
0b00010000 // 0x10
```

### how And (`$x & $y`

) works

The AND operator is much simpler to understand. It performs the AND operation on each bit so only values that match on both sides at the same time will be retrieved.

The result of `$x & $y`

is `0x20`

, I show you why:

```
// 1 & 1 = 1
// 1 & 0 = 0
// 0 & 0 = 0
0b00100000 // $x = 0x20
0b00110000 // $y = 0x30
AND ------ // $x & $y
0b00100000 // 0x20
```

### How Not (`~$x`

) works

The NOT operation requires a single parameter and it simply flips all bits passed. It transforms all bits with value 0 into 1, and all bits with value 1 into 0. See below:

```
// ~1 = 0
// ~0 = 1
0b00100000 // $x = 0x20
NOT ------ // ~$x
0b11011111 // 0xDF
```

If you ran this operation in PHP and decided to debug it using sprintf() you probably noticed a much wider number, right? I'll explain to you what's going on and how to fix it below in the Normalizing integers section.

### How Left and Right shifts (`$x << $n`

and `$x >> $n`

) work

Shifting bits are the same as multiplying or dividing numbers by multiples of two. What
it does is to make all bits travel `$n`

steps to the left or right.

I'll take a smaller binary number to represent this one, so things get easier to
comprehend. Take `$x = 0b0010`

as an example. If we shift `$x`

to the left once, that
bit 1 should move one step to the left:

```
$x = 0b0010;
$x = $x << 1;
// 0b0100
```

The same happens with the right shift. Now that `$x = 0b0100`

let's shift it to the
right twice:

```
$x = 0b0100;
$x = $x >> 2;
// 0b0001
```

Effectively, shifting a number `$n`

times to the left is the same as multiplying it by
two `$n`

times and shifting a number `$n`

times to the right is the same as dividing it
by two `$n`

times.

## What is a bitmask

There are many cool things we can do with these operations and other techniques. One great technique to always remember is the bitmask.

A bitmask is just an arbitrary binary of your choice, crafted to extract a very specific information.

For example, let's take the idea that an 8-bit signed integer is positive when the 8th
bit is not set (equals 0) and is negative when it is set. I then ask the question,
is `0x20`

positive or negative? And what about `0x81`

?

For this we can craft a very convenient byte with only the negative bit set (`0b10000000`

,
equivalent to `0x80`

) and use the `AND`

operation against `0x20`

. If the result is equal to
`0x80`

(`0b10000000`

, our mask) then it is a negative number, otherwise it is a positive number:

```
// 0x80 === 0b10000000 (bitmask)
// 0x20 === 0b00100000
// 0x81 === 0b10000001
0x20 & 0x80 === 0x80 // false
0x81 & 0x80 === 0x80 // true
```

This is often necessary when you're dealing with flags. You can even find usage examples in PHP itself: the error reporting flags.

It is possible to choose what kind of errors will be reported like this:

```
error_reporting(E_WARNING | E_NOTICE);
```

What's going on there? Well, just check the value you provided:

```
0b00000010 (0x02) E_WARNING
0b00001000 (0x08) E_NOTICE
OR -------
0b00001010 (0x0A)
```

So whenever PHP sees a Notice that could be reported it will check something like this:

```
// error reporting we set before
$e_level = 0x0A;
// Needs to throw a notice
if ($e_level & E_NOTICE === E_NOTICE)
// Flag is set: throws notice
```

And you will see this everywhere! Binary files, processors, all sorts of low level stuff!

## Normalizing integers

There's this very specific thing about PHP when handling binary numbers: our integers are 32 or 64-bit wide. This means that often we will have to normalize them to be able to trust our calculations.

For example, running the following operation in a 64-bit machine will get us an odd (but expected) result:

```
echo sprintf(
'0b%08b',
~0x20
);
// Expected
0b11011111
// Actual
0b1111111111111111111111111111111111111111111111111111111111011111
```

What happened there?! Well, a `NOT`

in that 8-bit integer (`0x20`

) flipped all zero bits and
transformed them into 1s. Guess what used to be zero? Exactly, all other `56`

bits to the left
that we ignored before!

Again, this is because PHP's integers are 32 or 64-bit long no matter which value you put inside!

This still works as you would expect, though. For example the operation
`~0x20 & 0b11011111 === 0b11011111`

results in `bool(true)`

. But always keep in mind that these
bits to the left are constantly there or you might end up having weird behaviours in your code.

To solve this issue, you can normalize your integers by applying a bitmask that clears all those
zeros. For example, to normalize `~0x20`

into an 8-bit integer we must `AND`

it with `0xFF`

(`0b11111111`

) so all previous `56`

bits will be set to zero.

```
~0x20 & 0xFF
-> 0b11011111
```

**Heads up!** Never forget what you're carrying in your variables otherwise you may end up with
an unexpected behavior. For example, let's see what happens when we right shift the above
value with and without 8-bit masking.

```
~0x20 & 0xFF
-> 0b11011111
0b11011111 >> 2
-> 0b00110111 // expected
(~0x20 & 0xFF) >> 2
-> 0b00110111 // expected
(~0x20 >> 2) & 0xFF
-> 0b11110111 // expected?
```

Just to make it clear: from the PHP stand point this IS expected, because you're clearly handling a 64-bit integer there. You must make it clear what YOUR program expects.

**Pro tip:** avoid silly mistakes like these by coding with TDD.

## Conclusion: binary is cool and so is PHP

I hope you enjoyed your read as much as I enjoyed writing this blog post. Most importantly: I hope this knowledge will enable you to take an adventure in this amazing world of binary data.

With these tools in hand, everything else is just a matter of finding the proper documentation on how binary files/protocols behave. Everything is a binary sequence after all.

I highly recommend you to have a look at the PDF spec, or the EXIF for image metadata. You may even want to play with your own implementation of the MessagePack serialization format or maybe Avro, Protobuf... Endless possibilities!

As you might have noticed, this article took me quite a bit (see what I did?) to write. If you'd like to reward the effort, please be so nice to share it and bookmark if you need to use it as reference.

Maybe soon I'll come back with some practical binary stuff :)

Cheers!