How to work with UTF-8 or wchar_t characters?

Am trying to display messages with non-english charcters, but for some reason non-english characters does not seem to work as expected.

Found this code here:

#include <stdio.h>
#include <wchar.h>
#include <locale.h>

int main()
{
    setlocale(LC_ALL, "");
    // setlocale(LC_ALL, "C.UTF-8"); // this also works

    wchar_t hello_eng[] = L"Hello World!";
    wchar_t hello_china[] = L"世界, 你好!";
    wchar_t hello_japan[] = L"こんにちは日本!";
    printf("%ls\n", hello_eng);
    printf("%ls\n", hello_china);
    printf("%ls\n", hello_japan);

    return 0;
}

Above code works like a charm with https://www.onlinegdb.com/ (does not run on most other online compilers).

It also does not work as expected with Mbed OS V6.12 + GNU compiler (8 2018q 4 major) or ARMC 6.15.
Not even if i invoke standard printf() via mbed_app.json.
With ARMC 6.15 i also face the problem that printf does not output anything on my boards (MAX32630FTHR and Artemis Thing Plus), that is why i modified the code to use BufferedSerial instead, but even that way only english characters are printed right, while non-english characters display wrong.

Mbed adapted code:

#include "mbed.h"
#include <stdio.h>
#include <wchar.h>
#include <locale.h>

BufferedSerial pc(USBTX, USBRX, 500000);
FileHandle *mbed::mbed_override_console(int fd) {
    return &pc;
} 

int main()
{
    setlocale(LC_ALL, "");

    ///setlocale(LC_ALL, "C.UTF-8"); // this also works

    wchar_t hello_eng[] = L"Hello World!\n";
    wchar_t hello_china[] = L"世界, 你好!\n";
    wchar_t hello_japan[] = L"こんにちは日本!\n";

    pc.write(hello_eng, sizeof(hello_eng));
    pc.write(hello_china, sizeof(hello_china));
    pc.write(hello_japan, sizeof(hello_japan));

    return 0;
}

Does anyone have an idea how could i work with UTF-8 or wchar in Mbed?

Hello Peter,

I’m on Linux and no one of the serial terminals mentioned by Simon is available to test he’s utf8 example.
I think the results depend also on the font selected in the serial terminal program.

Hello Zoltan,

Thanks for the tip. I am using Arduino’s serial monitor and also putty. Both output UTF-8 characters from Arduino code, so the serial monitor side seems to be OK.

Forgot to mention i am using Mbed Studio on a Windows 10 computer.
If i run the locale -a command in powerShell inside my project folder a ton of locales get listed including C, C.utf8, POSIX, etc. That is why i get a feeling these are not mistaken with windows(?) locales, though i am not sure in this regard.

If i also add

if (setlocale(LC_ALL, "C.utf8") == NULL) {
    printf("setlocale failed!\n");
  }

to my code, then it does not find C.utf8 locale. Only “”, “C” and “POSIX” locales do not return NULL.

While by adding

cout << "LC_ALL: " << setlocale(LC_ALL, "") << endl;
cout << "LC_CTYPE: " << setlocale(LC_CTYPE, "POSIX") << endl;

i get output

LC_ALL: C
LC_CTYPE: C

So to sum it up it seems on a windows 10 computer the “C” locale will be used even if you have set “POSIX” or “”, and no other locales are recognized by the code even if locale -a lists many of those.

Would my originally posted code work if i installed a linux system on a virtual machine on top of windows 10 and i would code in the Linux version of Mbed Studio?

Hi Peter,

I tested your code in my environment (Win10 and Mbed Studio 1.4.1). The wchar_t does not work, but UTF8 string literal simply worked. See below.

Thanks,
Toyo

Thanks for trying to help me Toyo & Zoltan!

Meanwhile i also tried to get my hands dirty with Linux. So installed Ubuntu 20.04 in virtual machine on my Win10 computer. My code does not work even if using Linux version of Studio. So either setlocale() is meant to be used for something different or this might be a bug in Mbed OS.

I also tried Simon’s really old utf8 example (Mbed 2) Zoltan suggested. That contains
#pragma import(__use_utf8_ctype)
However Mbed Studio comes with ARMC 6.15 and Studio does not even compile it because of this error:

#pragma import’ is an ARM Compiler 5 extension, and is not supported by ARM Compiler 6 [-Warmcc-pragma-import]

So i configured Mbed Studio to use GCC, in which case i get only a warning:

Unknown pragma ignored clang(-Wunknown-pragmas)

However that way a cardinal line has no effect and there is no magic, we get the wrong characters.

Toyo’s solution is a good fit for outputting hard coded text, however i was visioning about user input in any language.

Hello again!

Since my original post i managed to display Unicode characters and even emojis on the screen of my device and those also display right in serial monitor which is great in comparison where i started.

In the past few days i read further in the issue and possible C/C++ solutions. Based on my new knowledge i extended the Mbed code i posted in the 1st comment. The updated Mbed code is:

#include "mbed.h"
#include <stdio.h>
#include <wchar.h>
#include <locale.h>

int main() {
    setlocale(LC_ALL, "");
    
    wchar_t hello_eng[] = L"Hello World!";
    wchar_t hello_china[] = L"世界, 你好!";
    wchar_t hello_japan[] = L"こんにちは日本!";
    char plainString_asText[] = "ሄሎ ዓለም!"; //Local encoding, whatever that may be.
    char plainString_asHex[] = "\xe1\x88\x84\xe1\x88\x8e\x20\xe1\x8b\x93\xe1\x88\x88\xe1\x88\x9d\x21";
    wchar_t wideString[] = L"Përshëndetje botë!"; //Wide characters, usually UTF-16 or UTF-32.
    
    //char utf8String[] = u8"Всем привет!"; //UTF-8 encoding.   error: use of undeclared identifier 'u8'
    //char16_t utf16String[] = u"Përshëndetje botë!"; //UTF-16 encoding.   error: use of undeclared identifier 'u'
    //char32_t utf32String[] = U"Сәлемет пе әлем!"; //UTF-32 encoding.   error: use of undeclared identifier 'U'
    
    
    while (1) {
        
        printf("%ls\n", hello_eng);
        printf("%ls\n", hello_china);
        printf("%ls\n", hello_japan);
        printf("%s\n", plainString_asText);
        printf("%s\n", plainString_asHex);
        printf("%ls\n", wideString);
        
        //printf("%s\n", utf8String);
        //printf("%ls\n", (wchar_t*)utf16String); // corrupt output
        //printf("%ls\n", (wchar_t*)utf32String);

        wait_ms(2000);
    }
}

The above code works on Mbed Simulator (except for the commented lines which work however on onlinegdb.com), but the same code just outputs garbage when compiled locally with GCC_ARM compiler + Mbed Studio V1.4.1 + Mbed OS 6.12 on my Artemis board.

So there are some inconsistencies, which does not help much.