Strings and Bytes¶
There are two ways to work with text.
Let
ctypes
implicitly convert yourstr
orbytes
to a NULL terminatedA Null terminated string has an additional zero character on the end to signify the end of the string. char or wide char array before passing it a C function which takeschar *
orwchar_t *
arguments. This is slower (due to the conversion - although this can be cached) but more straight forward.Treat your text as an array, passing raw pointersThe memory address of an object stored in another variable. to C. This is harder but much more efficient.
This page focuses on option \(1\). For option \(2\) see Buffers and Arrays.
Reading strings¶
Passing strings from Python is easy. We demonstrate it with an equivalent to
Python's str.count()
, with the simplification that the sub-string we are
counting is only one character. If you want to work with bytes
instead, simply replace wchar_t
with char
.
// Required to use ``wchar_t``.
#include <stddef.h>
size_t count(wchar_t * text, wchar_t character) {
/* Count how many times ``character`` appears in ``text``. */
size_t out = 0;
for (size_t i = 0; text[i] != 0; i++) {
if (text[i] == character) {
out++;
}
}
return out;
}
Or a more C savvy person may prefer the equivalent but punchier code:
size_t count_(wchar_t * text, wchar_t character) {
/* Same as ``count()`` but written more compactly. */
size_t out = 0;
for (; *text; text++)
out += (character == *text);
return out;
}
The Python end is very no-nonsense. Let's compile the above:
from cslug import CSlug
slug = CSlug("strings-demo.c")
And run it:
>>> slug.dll.count("hello", "l")
2
Yay, it works! Now that we've got it going, let's talk about the code.
Notice the types of the inputs count()
: wchar_t *
and wchar_t
.
wchar_t *
accepts a str
of arbitrary length
but wchar_t
accepts only a single character str
.
We could have used pointers for both arguments,
but using just wchar_t
adds an implicit check that our single character
argument is indeed singular:
>>> slug.dll.count("This will break", "will")
ctypes.ArgumentError: argument 2: <class 'TypeError'>: wrong type
You may also notice that we've avoided having to specify the string's length
anywhere. Instead we just use text[i] != 0;
to tell us when to stop the for
loop. Here we are taking advantage of the fact that Python strings are NULL terminatedA Null terminated string has an additional zero character on the end to signify the end of the string., so to find the end of a string we simply need to find the NULL
(integer 0) at the end. There is a catch to doing this though. If our string
contained nulls in it then this function would exit prematurely. By default,
Python won't allow us to make this mistake:
>>> slug.dll.count("This sentence \x00 contains \x00 Nulls.", "a")
ctypes.ArgumentError: argument 1: <class 'ValueError'>: embedded null character
However if we force our way through…
>>> import ctypes
>>> slug.dll.count(ctypes.create_unicode_buffer("One z \x00 lots of zzzzzzzz"), "z")
1
If your string is likely to contains NULLs then pass the string length as a
separate parameter and use that to define your for
loops.
Caching the conversion overhead¶
When you pass a str
or bytes
to C you implicitly call
ctypes.create_unicode_buffer()
or ctypes.create_string_buffer()
,
performing a conversion or copy, before passing the result to C. If you pass the
same string to C multiple times then this conversion is repeated redundantly. To
avoid this, do the conversion yourself. i.e. This performs a conversion twice:
a = "Imagine that this string is a lot longer than it actually is."
slug.dll.count(a, "x")
slug.dll.count(a, "y")
Whereas this performs only one conversion:
a = "Imagine that this string is a lot longer than it actually is."
a_buffer = ctypes.create_unicode_buffer(a)
slug.dll.count(a_buffer, "x")
slug.dll.count(a_buffer, "y")
Writing to strings¶
Writing to strings inplace or to new strings is possible but not so streamlined.
In order to avoid the cacophony of memory issues that is creating and sharing buffers in C, strings should only be created in Python. To write a string in C, create an empty one of the right length then give it to C to populate. This unfortunately means that you must know how long your string will be before you write it.
As we've seen above, strings are converted to
ctypes
character arrays when passed to a C function. Writing to the converted on does not update the original and the converted array is discarded immediately after the function is complete, losing any changes the function made. To avoid this we must must do the conversion explicitly.
We'll show these in our next example: A C function which outputs the reverse of
a str
:
#include <stddef.h>
void reverse(wchar_t * text, wchar_t * out, int length) {
// Copies `text` into `out` in reverse order.
for (int i = 0; i < length; i++)
out[length - i - 1] = text[i];
}
Notice that the output string is an argument rather than a return
value.
This is in accordance with complication \(1\) above.
Let's compile the C code:
import ctypes
from cslug import CSlug
slug = CSlug("reverse.c")
And give ourselves something to reverse:
in_ = "Reverse this string."
Before using our C function, we need to make it an output to populate.
Because of complication \(2\), this must be a ctypes.Array
instead of a generic Python str
.
(Try giving it a Python str
anyway to see what happens).
out = ctypes.create_unicode_buffer(len(in_))
slug.dll.reverse(in_, out, len(in_))
>>> out.value
'.gnirts siht esreveR'
>>> out.value == in_[::-1]
True
Whenever you write a C function which requires weird handling in Python you should write a wrapper function to keep the weirdness out the way.
def reverse(in_):
"""
Returns a :class:`str` in reverse order.
"""
out = ctypes.create_unicode_buffer(len(in_))
slug.dll.reverse(in_, out, len(in_))
return out.value
>>> reverse(".esu ot reisae hcum si noitcnuf sihT")
'This function is much easier to use.'
Null terminated or not null terminated?¶
In C, strings are automatically null terminated if you define them with:
char string[] = "literal";
or for unicode strings:
wchar_t string[] = L"literal";
If you specify the length of the string then any spare characters are nulls:
char string[4] = "hello"; // Array too short to fit "hello", truncated to "hell" with a build warning.
char string[5] = "hello"; // Not null terminated.
char string[6] = "hello"; // Null terminated.
char string[7] = "hello"; // Double null terminated.
Similarly in ctypes
, both create_string_buffer()
and
create_unicode_buffer()
append a null if the length is
unspecified:
>>> ctypes.create_unicode_buffer("hello")[:]
'hello\x00'
>>> ctypes.create_unicode_buffer("hello\x00")[:]
'hello\x00\x00'
And set any spare characters to '\x00'
if the length is specified:
>>> ctypes.create_unicode_buffer("hello", 4)[:]
ValueError: string too long
>>> ctypes.create_unicode_buffer("hello", 5)[:]
'hello'
>>> ctypes.create_unicode_buffer("hello", 6)[:]
'hello\x00'
>>> ctypes.create_unicode_buffer("hello", 7)[:]
'hello\x00\x00'
In any other case you should assume that they aren't unless the documentation for a particular function you are using says it writes null-terminated strings.
Note
The '\x00'
character is an escape sequence (just like '\n'
), not a
literal backslash followed by an x and two zeros.
>>> len('\x00')
1
>>> print('invisible \x00 character')
invisible character
In the unlikely event that you want to type it literally, use a double backslash or raw string:
>>> print('\\x00')
\x00
>>> print(r'\x00')
\x00
>>> len('\\x00')
4