Create a UTF8 C-runtime
This suggestion is migrated to Developer Community. Please use below link to view the current status.
In the past few years I've worked on cross platform apps, mostly between Windows and linux. The biggest stumbling block are path and file names. In our latest project, it got so clunky, that we decided to drop all the UNICODE code compile the Windows side in MBCS. One thing that would help would be to have a C-runtime which works with UTF8 (also having one that drops the underscores would be another huge help.)
Thank you for your votes so far on this suggestion. We are currently reviewing this request. As we have more information, we will share it with you here. In the meantime, please make sure you comment here to capture the requirements you have for a UTF-8 capable CRT.
— Visual C++ Team
Guys, look at the latest UCRT source code that comes with Win10 1809 SDK. Major UTF-8 work done there!
Christopher Bennet commented
Today I discovered that VS2017 15.5.7 is saving .cpp files as UTF-16 but apparently earlier versions (15.4?) were saving them at UTF-8 at some point. Some files in the same project have the old UTF-8 format. This seems like a breaking change. Could we have a switch to control the file format?
I know I can save it as UTF-8 using File->SaveAs but I shouldn't have to convert files after the fact.
See: "Visual Studio 2017 creates UTF-16 source code files"
I have noticed you team have add some capabilities for UTF-8 in the Win10 Insider RS4, but I think that they are far from enough.
1. Add something like _CrtSetFileApisToUTF8 to CRT (!!!MOST URGENT!!!) as currently check the Beta choice would break most obsolete applications down.
2. _read and _write should all FIX double-translation codes, as _read one doesn't have one and _write one currently only works for OEM-ANSI.
3. C/C++ locales especially C++ ones should NEVER throw if C/C++ uses UTF-8 locales
4. Add something like C.UTF-8 in linux as INVARIANT UTF-8 LOCALE as using variant locales widely in programs might compromise stability and even security.
I'd be happy if they just implemented this: https://connect.microsoft.com/VisualStudio/feedback/details/3140796/add-setfileapistoutf8-function
GB Clark commented
I put 2 votes in. Is there a uservoice for the UTF-8 CRT anywhere? I'll go vote on that also.
Can we please get C++ header(s) for this too. Overloading will be much better than annoying A/W macros that break all sorts of 3rd party code. Broken down headers could possibly help reduce compile times too.
Just now, and once again, I ran across a related issue; std::locale names differ between the GNU C++ standard library and that provided by Microsoft. The standard leaves this undefined which causes cross-platform portability problems. (i.e. "en_US" and "en_US.utf8" throw exceptions in msvc.)
Данил Ишков commented
I'm having a lot of problems with parameter passing in C++. When I get const char* argv into int main it should be UTF-8 if user uses non-latin symbols.
I'm a bit puzzled by the "Under Review" comment. The requirements are really simple; create a CRT and C++ standard library that treats "char" as representing a member of the UTF-8 character set. It's no more complicated than that.
Please make it work with both C and C++ standard libraries. It would be great if std::regex worked with UTF-8 properly. Also ability to take a UTF-8 file paths in stream classes.
Stephen Doiel commented
Many API functions have two variants (e.g.: MessageBoxW, MessageBoxA).
In my experience most applications (there are rare exceptions) would work without change if MessageBoxA interpreted the string as UTF-8.
In my experience when updating a legacy application for internationalization it is easiest to keep all strings based on 8 bit chars and just change their interpretation to be utf-8. For the API calls to work correctly I've had to change the utf8 to utf16 before making windows calls. If windows interpreted those stings directly as utf-8 it would make life easier.
I recommend reading utf8everywhere.org
I'm actually working on a proposal for C2x to create a utf8 type, that's typedef'd as `unsigned char *`
So yeah, this isn't really the place for Microsoft to try adding (yet another) competing implementation of UTF-8 strings.
but I do agree with your general point, UTF-16, UTF-32, and especially UCS-2 NEED to go.
As you Team said, you has fixed tmpfile() to work properly. But since %temp% may have Unicode characters in ReFS/NTFS-without-short-file-name (or OEM characters while file apis are being ANSI), tmpfile() may not work.
I've read the source code of Universal CRT. Strings are marshalled as ANSI or OEM depending on what AreFileApisANSI() returns. Why not just add a _setcrtcp() and replace the existing code with _getcrtcp()?
Once I read the document for WideCharToMultiByte, in MSDN Library 2003. It said that Code Page 65001 was supported since Windows 98 and Windows NT 4.0 (as I tested, since SP4), in 1998, and ANSI(MBCS) or OEM code pages should be only for temporarily use. But it is ironic that until now many OSS libraries including OpenCV support only ANSI(MBCS) code pages, just because refactor them to call wide character routines like _wfopen will make programs too complex and unmaintainable. VS2013 once deprecated and removed MBCS MFC Library but received criticisms. C/C++ MAY HAVE NO WIDE CHARS, BUT MAY NOT HAVE NO NARROW CHARS.
Although Windows 10 IoT Core is more friendly, I choose Raspbian because open-source C/C++ libraries supports only MBCS on Win10. I will not choose Win10 unless Visual C++ supports UTF-8 directly using fopen, not sucky _wfopen.
UTF-8 is useful for server applications. Nowadays Linux replaced more and more Windows as server OS just because Windows C++ does not support UTF-8!
How awesome just set LC_ALL to en_US.UTF-8 or zh_CN.UTF-8, and how awful refactor whole programs and use non-portable _wfopen and other _w's!
Some of the 3rd-party libraries do not support WCHAR/TCHAR/_TCHAR/wchar_t, such as OpenCV, and it is hard to refactor them. Afaik, RedHat Linux supports UTF-8 since 2000, and for Ubuntu, 2004. So it's very late but helpful if Visual C++ also supports UTF-8.