Strings and Bytes

There are two ways to work with text.

  1. Let ctypes implicitly convert your str or bytes to a NULL terminatedA Null terminated string has an additional zero character on the end to signify the end of the string. char or wide char array before passing it a C function which takes char * or wchar_t * arguments. This is slower (due to the conversion - although this can be cached) but more straight forward.

  2. Treat your text as an array, passing raw pointersThe memory address of an object stored in another variable. to C. This is harder but much more efficient.

This page focuses on option \(1\). For option \(2\) see Buffers and Arrays.

Reading strings

Passing strings from Python is easy. We demonstrate it with an equivalent to Python's str.count(), with the simplification that the sub-string we are counting is only one character. If you want to work with bytes instead, simply replace wchar_t with char.

strings-demo.c
// Required to use ``wchar_t``.
#include <stddef.h>

size_t count(wchar_t * text, wchar_t character) {
  /* Count how many times ``character`` appears in ``text``. */
  size_t out = 0;
  for (size_t i = 0; text[i] != 0; i++) {
    if (text[i] == character) {
      out++;
    }
  }
  return out;
}

Or a more C savvy person may prefer the equivalent but punchier code:

size_t count_(wchar_t * text, wchar_t character) {
  /* Same as ``count()`` but written more compactly. */
  size_t out = 0;
  for (; *text; text++)
    out += (character == *text);
  return out;
}

The Python end is very no-nonsense. Let's compile the above:

from cslug import CSlug
slug = CSlug("strings-demo.c")

And run it:

>>> slug.dll.count("hello", "l")
2

Yay, it works! Now that we've got it going, let's talk about the code.

Notice the types of the inputs count(): wchar_t * and wchar_t. wchar_t * accepts a str of arbitrary length but wchar_t accepts only a single character str. We could have used pointers for both arguments, but using just wchar_t adds an implicit check that our single character argument is indeed singular:

>>> slug.dll.count("This will break", "will")
ctypes.ArgumentError: argument 2: <class 'TypeError'>: wrong type

You may also notice that we've avoided having to specify the string's length anywhere. Instead we just use text[i] != 0; to tell us when to stop the for loop. Here we are taking advantage of the fact that Python strings are NULL terminatedA Null terminated string has an additional zero character on the end to signify the end of the string., so to find the end of a string we simply need to find the NULL (integer 0) at the end. There is a catch to doing this though. If our string contained nulls in it then this function would exit prematurely. By default, Python won't allow us to make this mistake:

>>> slug.dll.count("This sentence \x00 contains \x00 Nulls.", "a")
ctypes.ArgumentError: argument 1: <class 'ValueError'>: embedded null character

However if we force our way through…

>>> import ctypes
>>> slug.dll.count(ctypes.create_unicode_buffer("One z \x00 lots of zzzzzzzz"), "z")
1

If your string is likely to contains NULLs then pass the string length as a separate parameter and use that to define your for loops.

Caching the conversion overhead

When you pass a str or bytes to C you implicitly call ctypes.create_unicode_buffer() or ctypes.create_string_buffer(), performing a conversion or copy, before passing the result to C. If you pass the same string to C multiple times then this conversion is repeated redundantly. To avoid this, do the conversion yourself. i.e. This performs a conversion twice:

a = "Imagine that this string is a lot longer than it actually is."
slug.dll.count(a, "x")
slug.dll.count(a, "y")

Whereas this performs only one conversion:

a = "Imagine that this string is a lot longer than it actually is."
a_buffer = ctypes.create_unicode_buffer(a)
slug.dll.count(a_buffer, "x")
slug.dll.count(a_buffer, "y")

Writing to strings

Writing to strings inplace or to new strings is possible but not so streamlined.

  1. In order to avoid the cacophony of memory issues that is creating and sharing buffers in C, strings should only be created in Python. To write a string in C, create an empty one of the right length then give it to C to populate. This unfortunately means that you must know how long your string will be before you write it.

  2. As we've seen above, strings are converted to ctypes character arrays when passed to a C function. Writing to the converted on does not update the original and the converted array is discarded immediately after the function is complete, losing any changes the function made. To avoid this we must must do the conversion explicitly.

We'll show these in our next example: A C function which outputs the reverse of a str:

reverse.c
#include <stddef.h>

void reverse(wchar_t * text, wchar_t * out, int length) {
  // Copies `text` into `out` in reverse order.
  for (int i = 0; i < length; i++)
    out[length - i - 1] = text[i];
}

Notice that the output string is an argument rather than a return value. This is in accordance with complication \(1\) above. Let's compile the C code:

import ctypes
from cslug import CSlug

slug = CSlug("reverse.c")

And give ourselves something to reverse:

in_ = "Reverse this string."

Before using our C function, we need to make it an output to populate. Because of complication \(2\), this must be a ctypes.Array instead of a generic Python str. (Try giving it a Python str anyway to see what happens).

out = ctypes.create_unicode_buffer(len(in_))
slug.dll.reverse(in_, out, len(in_))
>>> out.value
'.gnirts siht esreveR'
>>> out.value == in_[::-1]
True

Whenever you write a C function which requires weird handling in Python you should write a wrapper function to keep the weirdness out the way.

def reverse(in_):
    """
    Returns a :class:`str` in reverse order.
    """
    out = ctypes.create_unicode_buffer(len(in_))
    slug.dll.reverse(in_, out, len(in_))
    return out.value
>>> reverse(".esu ot reisae hcum si noitcnuf sihT")
'This function is much easier to use.'

Null terminated or not null terminated?

In C, strings are automatically null terminated if you define them with:

char string[] = "literal";

or for unicode strings:

wchar_t string[] = L"literal";

If you specify the length of the string then any spare characters are nulls:

char string[4] = "hello";  // Array too short to fit "hello", truncated to "hell" with a build warning.
char string[5] = "hello";  // Not null terminated.
char string[6] = "hello";  // Null terminated.
char string[7] = "hello";  // Double null terminated.

Similarly in ctypes, both create_string_buffer() and create_unicode_buffer() append a null if the length is unspecified:

>>> ctypes.create_unicode_buffer("hello")[:]
'hello\x00'
>>> ctypes.create_unicode_buffer("hello\x00")[:]
'hello\x00\x00'

And set any spare characters to '\x00' if the length is specified:

>>> ctypes.create_unicode_buffer("hello", 4)[:]
ValueError: string too long
>>> ctypes.create_unicode_buffer("hello", 5)[:]
'hello'
>>> ctypes.create_unicode_buffer("hello", 6)[:]
'hello\x00'
>>> ctypes.create_unicode_buffer("hello", 7)[:]
'hello\x00\x00'

In any other case you should assume that they aren't unless the documentation for a particular function you are using says it writes null-terminated strings.

Note

The '\x00' character is an escape sequence (just like '\n'), not a literal backslash followed by an x and two zeros.

>>> len('\x00')
1
>>> print('invisible \x00 character')
invisible  character

In the unlikely event that you want to type it literally, use a double backslash or raw string:

>>> print('\\x00')
\x00
>>> print(r'\x00')
\x00
>>> len('\\x00')
4