This post originated from an RSS feed registered with Ruby Buzz
by Daniel Berger.
Original Post: The strcpy function is dead! Long live memcpy!
Feed Title: Testing 1,2,3...
Feed URL: http://djberg96.livejournal.com/data/rss
Feed Description: A blog on Ruby and other stuff.
I got a very interesting bug report from Brian Marick (yes, that one) for the win32-clipboard package. He reported that Unicode characters with null bytes in them (e.g. the Tibetan character 0x0F00) were causing the string to be terminated prematurely.
It turns out the problem was with strcpy() and strlen(). Here are the original two lines that caused the problems:
At first I thought that Microsoft's _tcslen() and _tcscpy() functions would Do The Right Thing™. But, no, they didn't work.
Given that Unicode characters can contain null bytes (a fact, btw, which I was unaware of until now), why would I ever use strcpy() again in lieu of its inability to handle Unicode properly?
That's not a bug in Windows, it's a bug in the code. strcpy() is only for single byte character sets. For multibyte character sets, use _mbscpy. strlen() works correctly for both single byte and multibyte character strings, but for Unicode, you need to use wcscpy() and wcslen() (wcs = wide character string). Wide characters are 16 bits on Windows (32 bits on OS X and other Unixes), so yeah, some characters may have null bytes in them.
_tcslen() and _tcscpy() are not functions, but macros aliases for the versions of these functions that match your program's default character set, as determined by whether or not _UNICODE and/or _MBCS are #defined. Likewise, TCHAR is an alias for either CHAR or WCHAR.
This should do what was intended: WCHAR* buffer = (WCHAR*) GlobalAlloc(GPTR, (wcslen(data) + 1) * sizeof(WCHAR)); wcscpy(buffer, data);
First....how the heck did my post end up in a forum?! Anyhoo...
> That's not a bug in Windows, it's a bug in the code. > strcpy() is only for single byte character sets.
I understand that. See below.
> For multibyte character sets, use _mbscpy. strlen() works > correctly for both single byte and multibyte character > strings, but for Unicode, you need to use wcscpy() and > wcslen() (wcs = wide character string). Wide characters > are 16 bits on Windows (32 bits on OS X and other Unixes), > so yeah, some characters may have null bytes in > them. > > _tcslen() and _tcscpy() are not functions, but macros > aliases for the versions of these functions that match > your program's default character set, as determined > by whether or not _UNICODE and/or _MBCS are #defined. > Likewise, TCHAR is an alias for either CHAR or WCHAR.
I understand that they're macros. The UNICODE constant is defined (if you look at the source) and MBCS is not, which means wcslen is being used behind the scenes according to the MSDN docs. Maybe I should just default to defining MCBS for all my C extensions. Is there a downside to that?
> This should do what was intended: > > WCHAR* buffer = (WCHAR*) GlobalAlloc(GPTR, (wcslen(data) + > 1) * sizeof(WCHAR)); > wcscpy(buffer, data); >
Unless there's a drawback to the code I'm using now, I'll leave it alone. :)