From ddca27a6cf1390a74f13caae818989422e614222 Mon Sep 17 00:00:00 2001 From: Stefan Behnel Date: Mon, 15 Apr 2013 19:15:23 +0200 Subject: [PATCH] document c_string_type and c_string_encoding directives in string tutorial --- docs/src/tutorial/strings.rst | 71 +++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 71 insertions(+) diff --git a/docs/src/tutorial/strings.rst b/docs/src/tutorial/strings.rst index a745a82..cea7d76 100644 --- a/docs/src/tutorial/strings.rst +++ b/docs/src/tutorial/strings.rst @@ -305,6 +305,77 @@ For C++ strings, decoding slices will always take the proper length of the string into account and apply Python slicing semantics (e.g. return empty strings for out-of-bounds indices). + +Auto encoding and decoding +-------------------------- + +Cython 0.19 comes with two new directives: ``c_string_type`` and +``c_string_encoding``. They can be used to change the Python string +types that C/C++ strings coerce from and to. By default, they only +coerce from and to the bytes type, and encoding or decoding must +be done explicitly, as described above. + +There are two use cases where this is inconvenient. First, if all +C strings that are being processed (or the large majority) contain +text, automatic encoding and decoding from and to Python unicode +objects can reduce the code overhead a little. In this case, you +can set the ``c_string_type`` directive in your module to ``unicode`` +and the ``c_string_encoding`` to the encoding that your C code uses, +for example:: + + # cython: c_string_type=unicode, c_string_encoding=utf8 + + cdef char* c_string = 'abcdefg' + + # implicit decoding: + cdef object py_unicode_object = c_string + + # explicit conversion to Python bytes: + py_bytes_object = c_string + +The second use case is when all C strings that are being processed +only contain ASCII encodable characters (e.g. numbers) and you want +your code to use the native legacy string type in Python 2 for them, +instead of always using Unicode. In this case, you can set the +string type to ``str``:: + + # cython: c_string_type=str, c_string_encoding=ascii + + cdef char* c_string = 'abcdefg' + + # implicit decoding in Py3, bytes conversion in Py2: + cdef object py_str_object = c_string + + # explicit conversion to Python bytes: + py_bytes_object = c_string + + # explicit conversion to Python unicode: + py_bytes_object = c_string + +The other direction, i.e. automatic encoding to C strings, is only +supported for the ASCII codec (and the "default encoding", which is +runtime specific and may or may not be ASCII). This is because +CPython handles the memory management in this case by keeping an +encoded copy of the string alive together with the original unicode +string. Otherwise, there would be no way to limit the lifetime of +the encoded string in any sensible way, thus rendering any attempt to +extract a C string pointer from it a dangerous endeavour. As long +as you stick to the ASCII encoding for the ``c_string_encoding`` +directive, though, the following will work:: + + # cython: c_string_type=unicode, c_string_encoding=ascii + + def func(): + ustring = u'abc' + cdef char* s = ustring + return s[0] # returns u'a' + +(This example uses a function context in order to safely control the +lifetime of the Unicode string. Global Python variables can be +modified from the outside, which makes it dangerous to rely on the +lifetime of their values.) + + Source code encoding -------------------- -- 2.7.4