Null-terminated multibyte strings
A null-terminated multibyte string (NTMBS), or "multibyte string", is a sequence of nonzero bytes followed by a byte with value zero (the terminating null character).
Each character stored in the string may occupy more than one byte. The encoding used to represent characters in a multibyte character string is locale-specific: it may be UTF-8, GB18030, EUC-JP, Shift-JIS, etc. For example, the char array { '\xe4','\xbd','\xa0','\xe5','\xa5','\xbd','\0' } is an NTMBS holding the string "你好" in UTF-8 multibyte encoding: the first three bytes encode the character 你, the next three bytes encode the character 好. The same string encoded in GB18030 is the char array { '\xc4', '\xe3', '\xba', '\xc3', '\0' }
In some multibyte encodings, any given multibyte character sequence may represent different characters depending on the previous byte sequences, known as "shift sequences". Such encodings are known as state-dependent: knowledge of the current shift state is required to interpret each character. An NTMBS is only valid if it begins and ends in the initial shift state: if a shift sequence was used, the corresponding unshift sequence has to be present before the terminating null character. Examples of such encodings are BOCU-1 and SCSU
A multibyte character string is layout-compatible with null-terminated byte string
Functions
Multibyte/wide character conversions | |
Defined in header
<stdlib.h> | |
returns the number of bytes in the next multibyte character (function) |
|
converts the next multibyte character to wide character (function) |
|
(C11)
|
converts a wide character to its multibyte representation (function) |
(C11)
|
converts a narrow multibyte character string to wide string (function) |
(C11)
|
converts a wide string to narrow multibyte character string (function) |
Defined in header
<wchar.h> | |
(C95)
|
checks if the mbstate_t object represents initial shift state (function) |
(C95)
|
widens a single-byte narrow character to wide character, if possible (function) |
(C95)
|
narrows a wide character to a single-byte narrow character, if possible (function) |
(C95)
|
returns the number of bytes in the next multibyte character, given state (function) |
(C95)
|
converts the next multibyte character to wide character, given state (function) |
(C95)(C11)
|
converts a wide character to its multibyte representation, given state (function) |
(C95)(C11)
|
converts a narrow multibyte character string to wide string, given state (function) |
(C95)(C11)
|
converts a wide string to narrow multibyte character string, given state (function) |
Defined in header
<uchar.h> | |
(C23)
|
converts a narrow multibyte character to UTF-8 encoding (function) |
(C23)
|
converts UTF-8 string to narrow multibyte encoding (function) |
(C11)
|
converts a narrow multibyte character to UTF-16 encoding (function) |
(C11)
|
converts a UTF-16 character to narrow multibyte encoding (function) |
(C11)
|
converts a narrow multibyte character to UTF-32 encoding (function) |
(C11)
|
converts a UTF-32 character to narrow multibyte encoding (function) |
Types
Defined in header
<wchar.h> | |
(C95)
|
conversion state information necessary to iterate multibyte character strings (class) |
Defined in header
<uchar.h> | |
(C23)
|
8-bit character type (typedef) |
(C11)
|
16-bit character type (typedef) |
(C11)
|
32-bit character type (typedef) |
Macros
Defined in header
<limits.h> | |
MB_LEN_MAX |
maximum number of bytes in a multibyte character, for any supported locale (macro constant) |
Defined in header
<stdlib.h> | |
MB_CUR_MAX |
maximum number of bytes in a multibyte character, in the current locale (macro variable) |
References
- C23 standard (ISO/IEC 9899:2024):
- 7.10 Sizes of integer types <limits.h> (p: TBD)
- 7.22 General utilities <stdlib.h> (p: TBD)
- 7.28 Unicode utilities <uchar.h> (p: TBD)
- 7.29 Extended multibyte and wide character utilities <wchar.h> (p: TBD)
- 7.31.12 General utilities <stdlib.h> (p: TBD)
- 7.31.16 Extended multibyte and wide character utilities <wchar.h> (p: TBD)
- K.3.6 General utilities <stdlib.h> (p: TBD)
- K.3.9 Extended multibyte and wide character utilities <wchar.h> (p: TBD)
- C17 standard (ISO/IEC 9899:2018):
- 7.10 Sizes of integer types <limits.h> (p: TBD)
- 7.22 General utilities <stdlib.h> (p: TBD)
- 7.28 Unicode utilities <uchar.h> (p: TBD)
- 7.29 Extended multibyte and wide character utilities <wchar.h> (p: TBD)
- 7.31.12 General utilities <stdlib.h> (p: TBD)
- 7.31.16 Extended multibyte and wide character utilities <wchar.h> (p: TBD)
- K.3.6 General utilities <stdlib.h> (p: TBD)
- K.3.9 Extended multibyte and wide character utilities <wchar.h> (p: TBD)
- C11 standard (ISO/IEC 9899:2011):
- 7.10 Sizes of integer types <limits.h> (p: 222)
- 7.22 General utilities <stdlib.h> (p: 340-360)
- 7.28 Unicode utilities <uchar.h> (p: 398-401)
- 7.29 Extended multibyte and wide character utilities <wchar.h> (p: 402-446)
- 7.31.12 General utilities <stdlib.h> (p: 456)
- 7.31.16 Extended multibyte and wide character utilities <wchar.h> (p: 456)
- K.3.6 General utilities <stdlib.h> (p: 604-614)
- K.3.9 Extended multibyte and wide character utilities <wchar.h> (p: 627-651)
- C99 standard (ISO/IEC 9899:1999):
- 7.10 Sizes of integer types <limits.h> (p: 203)
- 7.20 General utilities <stdlib.h> (p: 306-324)
- 7.24 Extended multibyte and wide character utilities <wchar.h> (p: 348-392)
- 7.26.10 General utilities <stdlib.h> (p: 402)
- 7.26.12 Extended multibyte and wide character utilities <wchar.h> (p: 402)
- C89/C90 standard (ISO/IEC 9899:1990):
- 4.1.4 Limits <float.h> and <limits.h>
- 4.10 GENERAL UTILITIES <stdlib.h>
- 4.13.7 General utilities <stdlib.h>
See also
C++ documentation for Null-terminated multibyte strings
|