Demystify Strings


Recently I was reading somebody’s code and noticed few misuse of data-structures like BSTR and CString. This reminded me of the days when I first time started writing programs on Windows platforms and I have encountered overwhelming number of string data-structures. This inspired to write this article to demystify some of the basic concepts related to the string.
What is String?
In simple words, string is a contiguous set of characters. Additionally, start and end of the string are well defined. Typically, the start of the string is known by address of start character of the string. And the end of the string is based on the convention chosen by the programming language (for detail refer “End of String” later in this article).  All the characters between start and end in memory together define the string.
Character-Encoding Schemes
Since the character is a building block of the string, let us first understand the same in detail. Typically, a character is 8 bits long (char) and so it can store up to 256 values. Of the 256 characters, ANSI standardizes first 128 characters and it includes numbers, common punctuation, English alphabet in lower and upper cases, and so on. The other 128 characters are not standardized which mean they can be different things depending on something called the codepage. For standard English codepage, they have some Greek letters, more punctuation like ellipsis, character-mode graphics symbols, and other stuff. Non-English codepages store their native alphabets in these 128 characters. This encoding scheme, where character is of size 8 bits is commonly referred as Single-byte Character-set or SBCS.
As you can observe, a single-byte value for a character is not enough for languages (like eastern languages) that have letters more than 256. One solution to problem is to use something called Multi-byte Character-set or MBCS or Variable-width Character-set, where a character can take either one or two bytes (i.e. char or wchar_t). Since, the maximum size per character can only be double-bytes; this character-encoding scheme is also referred as Double-byte Character-set or DBCS. Since, not all characters in MBCS are of same size, you need to use special functions to traverse the characters in the string.
Along with size limitation of SBCS and inconveniences in usage of MBCS, both SBCS and MBCS codes are localized character-sets, which means they are specific to a given country language. At the opposite Unicode (another encoding scheme) encode a character in double-byte and it can represent several country languages simultaneously. It can possibly represent 65536 values, out of which around 64K range has already been allocated to different alphabets from various languages. This still left lot of room for new letters.  In this way, Unicode standard provides unique representation for every character irrespective of platform, programming language and regional language.
Common Data-Types
The char data-type is used in SBCS/MBCS to represent a character in single-byte. And the wchar_t data-type is used in Unicode to represent a wide character. To distinguish wide character strings with single-byte character strings, the literal “L” needs to be prefixed to the string (refer usage of “L” in example implementation given below).
Sometimes, it is required that your application should support both single-byte and Unicode supported OS. For example, Win9x does not support Unicode whereas Win NT onwards OS completely supports the Unicode. In such kind of scenario, one cannot directly use the char or wchar_t in the code. This can be done by using generic text mappings. The generic text mappings replace the standard char or LPSTR types with generic TCHAR or LPTSTR macros. In addition to this, it comes with macro _T() or _TEXT() to prefix the literal “L” for wide character string. These kinds of macros will map to different types and functions depending on whether you have compiled with Unicode or MBCS (or neither) defined.


#ifdef _UNICODE
typedef wchar_t TCHAR;
#else
typedef char TCHAR;
#endif
#ifdef _UNICODE
#define _T(x) L ## x
#else#define _T(x) x
#endif

Following table depicts basic data-types used to represent null-terminated strings on Windows platform.













Type
Meaning
char
Single-byte character
wchar_t
Double-byte or Unicode character
TCHAR
Generic character based on preprocessor settings,  char or wchar_t
WCHAR
Unicode character wchar_t
LPSTR
Null-terminated string of char i.e char*
LPCSTR
Constant null-terminated string of char i.e. const char*
LPWSTR
Null-terminated Unicode string i.e. wchar_t*
LPCWSTR
Constant null-terminated Unicode string i.e. const wchar_t*
LPTSTR
Null-terminated string of TCHAR i.e. TCHAR*
LPCTSTR
Constant null-terminated string of TCHAR i.e. const TCHAR*
_T(x)
Prefix ”L” to the literal in Unicode build mode

Following implementation shows few examples of the strings. It is worth noting that LPSTR, LPCSTR, LPWSTR, LPCWSTR, LPTSTR and LPTCSTR expands to either pointer to char or pointer to wchar_t in constant or non-constant manner (refer above table for respective meanings). Since it directly deals with the raw-pointer, memory allocation/de-allocation issue needs to handle appropriately by the developer. For example, if you allocate the string on the heap then you have delete the string appropriately (as you can see, in following implementation, strings are not allocated on heap).


Void ExampleCStyleStrings()
{
    //
    // C-Style strings (null-terminated).
    //

    // Single-byte character set.
    // Trivia: dealing with pointer – handle allocation /       
    //                                deallocation appropriately
    LPSTR str1 = "This is a test string"; // char*
    LPCSTR str2 = "This is a test string"; // const char*
    std::cout << "length=" << strlen(str2) << std::endl;

    // Wide, double-byte character set.
    // Trivia: dealing with pointer – handle allocation /       
    //                                deallocation appropriately
    // Note the usage of "L" to indicate wide string.
    LPWSTR wstr1 = L"This is a test string"; // wchar_t*
    LPCWSTR wstr2 = L"This is a test string"; // const wchar_t*
    std::cout << "length=" << wcslen(wstr2) << std::endl;

    // Single-byte or double-byte based on pre-processor settings.
    // Trivia: dealing with pointer – handle allocation /       
    //                                deallocation appropriately
    // Note the usage of "_T()" macro to prefix "L" when double-
    //                                                   byte.
    LPTSTR tstr1 = _T("This is a test string"); //char* or wchar_t*
    LPCTSTR tstr2 = _T("This is a test string"); //const char* or
    //                                             const wchar_t*
    std::cout << "length=" << _tcslen(tstr2) << std::endl;

    // Standard library.
    // Trivia: class - no worry about allocation/deallocation
    //                 (no pointers).
    // Followings are constructor calls.
    std::string str = "This is a test string.";//single-byte string
    std::wstring wstr = L"This is a test string.";//double-byte str
    std::cout << "length=" << wstr.size() << std::endl;

    // MFC/ATL (platform specific - Windows or maybe few other)
    // Trivia: class - no worry about allocation/deallocation
    //                 (no pointers).
    // Followings are constructor calls.
    CString cstr1 = "This is a test string."; //single-byte string
    CString cstr2 = L"This is a test string."; //double-byte string
    CString cstr3 = _T("This is a test string.");//single- or
                                                 // double-byte str
    std::cout << "length=" << cstr3.GetLength() << std::endl;
}


In addition, above implementation shows std::string and std::wstring data-structures which are provided by the Standard Library. std::string is single-byte string (char) whereas std::wstring is double-byte string (wchar_t). The advantage of using string/wstring is that you are no more dealing with raw-pointers but you are dealing with class objects, and so allocation/de-allocation issues are no more your concerns as they are handle internally in the string/wstring classes. Another advantage is that Standard Library is available on all of the platforms.
CString data-structure shown in above example is provided by MFC and ATL (and so it mostly available on Windows platform only). CString can be used to create string from both MBCS or Unicode string. Being a template class, CString code is completely available in header files and so this allows using CString even for non-MFC/ATL application.
Conversion Macros
ATL (atlconv.h) provides a bunch of very useful macros for converting between different character format. The basic form of these macros is X2Y(), where X is the source format and Y is the destination format. The possible conversion formats are shown below in the table.







Abbreviation
Meaning
A
ANSI or MBCS character or LPSTR
W
Wide, double-byte or Unicode character, LPWSTR
T
Generic character based on preprocessor settings,  LPTSTR
OLE
OLE character string (typically wide), LPOLESTR
C
Conversion to constant string

For example, A2W converts an LPSTR to an LPWSTR, and so on. Some of them are given in table below.














Macro
From
To
A2W
LPSTR
LPWSTR
W2A
LPWSTR
LPSTR
A2CW
LPSTR
LPCWSTR
W2CA
LPWSTR
LPCSTR
T2A
LPTSTR
LPSTR
A2T
LPSTR
LPTSTR
T2W
LPTSTR
LPWSTR
W2T
LPWSTR
LPTSTR
T2CA
LPTSTR
LPCSTR
A2CT
LPSTR
LPCTSTR
T2CW
LPTSTR
LPCWSTR
W2CT
LPWSTR
LPCTSTR

When using the above string conversion macros, you need to include the USES_CONVERSION macro at the beginning of your function. It declares few local/stack variables and uses them during the conversion. When using these macros in a tight loop, you need to aware of that the creation/destruction of these temporarily variables can affect the efficiency of your method. In addition, you should not use these macros at the return statement of your method – as you may probably be returning the pointer to a local variable. Instead, you may need to create a copy of the string and return it outside to the caller.
End of String
In C or C++ language, the end of the string is defined by marking the end-byte by zero. These types of strings are commonly referred as null-terminated strings. When the string is based on SBCS, all 8 bits at the end are marked by zero value (‘\0’ or 0x00). And, in Unicode strings, all 16 bits at the end are marked by zero value (0x0000). Whatever type of strings, we have discussed above are null-terminated or C-style strings.
The programming languages like PASCAL or Basic follow a different convention than C or C++, for determining the end of the string. In PASCAL or Basic, the length of the string is prefixed before the actual character-set. This way the length of string method works in constant-time in these languages whereas works in linear-time in C or C++. However, prefixing the length requires more memory space than keeping the null-byte at the end.
Automation Data-Types
Following table depicts basic data-types used by automation methods on Windows platform.







Type
Meaning
OLECHAR
Unicode character (wchar_t)  or (char) depends on OLE2ANSI
LPOLESTR
string of OLECHAR i.e. OLECHAR*
LPCOLESTR
constant string of OLECHAR i.e. const OLECHAR*
OLESTR(x)
Prefix “L” to the literal to make it an LPCOLESTR
BSTR
Length prefixed string based on OLECHAR i.e. OLECHAR *

BSTR is hybrid between Pascal style and C style string. It is both null-terminated as well as length prefixed string. This is used by automation data manipulation methods or COM interfaces where the string should be able to get transferred across languages (say from Visual C++ to Visual Basic).  On Windows platform, it is a Unicode string and is a pointer to OLECHAR.
typedef OLECHAR FAR* BSTR;
Following sample code shows usage of BSTR. BSTR is created by SysAllocString() method and destroyed by SysFreeString() method. SysAllocString() creates BSTR from wide character-set (on Windows platform) and prefix the length to the character-set.



void ExampleBSTR()
{
    //
    // BSTR:
    //       Unicode character set on Windows.
    //       Null-terminated as well as length prefixed string.
    //       (Though only prefixed-length is used).
    //

    BSTR bstr = NULL; // OLECHAR*

    // Following way of creating BSTR is incorrect –
    //    as no string-length preceded.
    // bstr = L"This is a test string.";

    // Create the string using SysAllocString.
    // Remember: actual string is preceded by the length of string.
    bstr = SysAllocString(L"This is a test string.");
    // Destroy the string using SysFreeString.
    SysFreeString(bstr);

    // Cannot directly create from single-byte string using
    // SysAllocString.
    //bstr = SysAllocString("This is a test string.");
    bstr = _com_util::ConvertStringToBSTR("This is a test string.");

    // Extracting character-set.
    // Direct conversion from BSTR to LPCTSTR
    // (ONLY WORKS IN UNICODE MODE)
    TCHAR tstr[256];
    _stprintf_s(tstr, _T("%s"), (LPCTSTR)bstr);
    std::cout << "TCHAR[] = " << tstr << std::endl;

    // Remember to destroy the BSTR -- it is just a pointer.
    SysFreeString(bstr);
}



As BSTR is again a raw-pointer, wrappers like _bstr_t and CComBSTR have been introduced around the same. _bstr_t is a reference-count based wrapper on BSTR. It allows creation of string from both single-byte and wide character-set. It allows access to underlying C-style string, but does not allow access to underlying BSTR string. It provide methods like GetBSTR(), GetAddress() and copy() to get the deep-copied BSTR though. The unavailability of direct access to underlying BSTR makes it less handy when dealing with COM interfaces. _bstr_t is part of comsupp.lib library and declared in comutil.h file. Following code, demonstrate a typical usage of _bstr_t wrapper.



void Example_bstr_t()
{
    //
    // _bstr_t: Wrapper on BSTR.
    //
    //

    // Create from single/double-byte character set.
    _bstr_t bstrWrapper = "This is a test string.";
    bstrWrapper = L"This is a test string.";

    // Create from BSTR -- constructor/assignment operator
    // will do the deep-copy
    BSTR tmp = SysAllocString(L"This is a test string.");
    bstrWrapper = tmp;
    bstrWrapper = tmp;
    SysFreeString(tmp);

    std::cout << "_bstr_t = " << bstrWrapper << std::endl;

    // Extracting character-set.
    // Direct conversion from _bstr_t to LPCTSTR (WORKS IN BOTH
    // ANSI & UNICODE MODE)
    TCHAR tstr[256];
    _stprintf_s(tstr, _T("%s"), (LPCTSTR)bstrWrapper);
    std::cout << "TCHAR[] = " << tstr << std::endl;

    // Extracting BSTR.
    // Get a deep-copy of the string.
    BSTR bstr = bstrWrapper.copy();
    SysFreeString(bstr);
    // Take the control of underlying BSTR in bstrWrapper
    bstr = bstrWrapper.Detach();
    //std::cout << "_bstr_t = " << bstrWrapper << std::endl;
    // By now bstrWrapper is empty.
    SysFreeString(bstr);
}


Another wrapper around BSTR i.e. CComBSTR is provided by ATL. Like _bstr_t, CComBSTR allow creation of the string from both single-byte and wide character-set. Unlike _bstr_t, give direct access to underlying BSTR. This makes it more natural to use as a BSTR and makes it more handy in use with COM methods. Being a template implementation, the whole code of CComBSTR can be found in atlbase.h header file.



void ExampleCComBSTR()
{
    //
    // CComBSTR: Wrapper on BSTR.
    //
    //

    // Create from single/double-byte character set.
    CComBSTR comBstr("This is a test string.");
    comBstr = L"This is a test string.";
    comBstr = _T("This is a test string.");
    comBstr.Empty();
    comBstr.Append(_T("This is a test string."));

    // Create from BSTR -- constructor/assignment operator will
    // do the deep-copy
    BSTR tmp = SysAllocString(L"This is a test string.");
    comBstr = tmp;
    comBstr = tmp;
    SysFreeString(tmp);

    // Create from _bstr_t
    _bstr_t bstrWrapper = _T("This is a test string.");
    comBstr.AppendBSTR(bstrWrapper);

    std::cout << "CComBSTR = " << comBstr << std::endl;

    // Extracting character set - no direct way
    _bstr_t tmp2 = comBstr;
    TCHAR tstr[256];
    _stprintf_s(tstr, _T("%s"), (LPCTSTR)tmp2);

    // Extracting BSTR.
    // Get the underlying BSTR from the CComBSTR - no deep copy
    BSTR bstr = NULL;
    bstr = comBstr;
    //SysFreeString(bstrFinish); // No need to free as comBstr
                                 // still refering to same BSTR.

    // Deep-copy - Create new BSTR.
    bstr = comBstr.Copy();
    SysFreeString(bstr); // Free it
}



Demystify Strings Demystify Strings Reviewed by Sourabh Soni on Monday, September 28, 2009 Rating: 5

3 comments

  1. This article is good.
    Having one doubt. If BSTR as a return type of Function how to create and call function so it doesn't leak handle/memory?
    Thanks.

    ReplyDelete

Author Details

Image Link [https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgZYEKEHJPev0oC4dyp_vZFA3Q6PM99sbRGRgel5lr3s9PJPKQORaMDhc5f0wLqZjHSE79OnUom2STt1asn17AKrN2FPD6gH6gjz4sCmL-fCfCp5ksFbAT6sqxx02KLzi2C_Q2kSMTtQhIM/s1600/sourabhdots3.jpg] Author Name [Sourabh Soni] Author Description [Technocrat, Problem Solver, Corporate Entrepreneur, Adventure Enthusiast] Facebook Username [sourabh.soni.587] Twitter Username [sourabhs271] GPlus Username [#] Pinterest Username [#] Instagram Username [#] LinkedIn Username [sonisourabh] Youtube Username [sonisourabh] NatGeo Username [271730]