[Python 3.5]Encoding error

Vanilla · January 2, 2016

Hello metin2dev,

I wanted to ask for help regarding encodings. There's a lot of trouble and one big thing when I migrated to python 3.5 on the client is the encoding. I managed to make all the ui files etc. work with utf-8.

The results are clear. You can see everything, even characters like ö, ä, ü. In German language, they're very important. So what's the big deal? The problem is the input. Everytime I tried to input such characters, the text box stopped working. The reason for this is an encoding error python responds with. It tells me that utf-8 can't decode these bytes. I checked the IME part on the client source after that.

I've made some research and soon I changed the whole encoding in the source to UTF-8, except for the WebBrowser project which still sticks to CP_ACP. But that doesn't make a difference anyway here. The textbox started working and there are no encoding errors more. Instead, as you can see in the picture, the following weird result shows up.

As you can see I'm debugging the client and added a line to make sure it prints the GetText results from IME before handing them over to python. This symbol happens to be there when I try to type an ö character. I guess there's some messing up with encodings and therefore this weird symbol appears. Maybe you have an idea what could've caused the error. As already stated the whole client should use UTF-8 now.

Thank you for your time!

Best regards,

Vanilla

Edited August 24, 2022 by Metin2 Dev
Core X - External 2 Internal

ds_aim · January 2, 2016

Locale.cpp

int MULTI_LOCALE_CODE = 949;

Vanilla · January 2, 2016

Already had a check on that. It's changed.

locale.cpp

int MULTI_LOCALE_CODE = CP_UTF8;

Ken · January 2, 2016

In 3.x version of python, Python started to use Unicode string as I know. Did you try to convert it?

Everything else is a fake!

It's nice to hear this. Welcome back Vanilla.

Kind Regards ~ Ken

BeHappy4Ever · January 2, 2016

Lol vanilla is back? <3 we missed you

Vanilla · January 2, 2016

I did. Every string gets encoded.

These are the functions I use for string encoding:

Spoiler

char* PyString_AsString(PyObject* v)
{
   char* my_result = NULL;
   if (PyUnicode_Check(v)) {
       PyObject * temp_bytes = PyUnicode_AsEncodedString(v, "utf-8", "ignore"); // Owned reference
       if (temp_bytes != NULL) {
           my_result = PyBytes_AS_STRING(temp_bytes); // Borrowed pointer
           my_result = strdup(my_result);
           Py_DECREF(temp_bytes);
       }
   }
   else if (PyBytes_Check(v)) {
       my_result = PyBytes_AS_STRING(v); // Borrowed pointer
       my_result = strdup(my_result);
   }
   else {
       PyObject* str_exc_type = PyObject_Repr(v);
       PyObject* pystr = PyUnicode_AsEncodedString(str_exc_type, "utf-8", "ignore");
       my_result = PyBytes_AS_STRING(pystr);
       Py_XDECREF(str_exc_type);
       Py_XDECREF(pystr);
       Tracef("Error with encoding!!!\r\n");
   }
   return my_result;
}

char* PyString_AsStringUtils(PyObject* v)
{
   char* my_result = NULL;
   if (PyUnicode_Check(v)) {
       PyObject * temp_bytes = PyUnicode_AsEncodedString(v, "utf-8", "ignore"); // Owned reference
       if (temp_bytes != NULL) {
           my_result = PyBytes_AS_STRING(temp_bytes); // Borrowed pointer
           my_result = strdup(my_result);
           Py_DECREF(temp_bytes);
       }
   }
   else if (PyBytes_Check(v)) {
       my_result = PyBytes_AS_STRING(v); // Borrowed pointer
       my_result = strdup(my_result);
   }
   else {
       Tracef("Error with encoding!!!");
   }
   return my_result;
}

Maybe I should change the ignore flag to strict and see if errors occur :/

ds_aim · January 2, 2016

10 minutes ago, BeHappy4Ever said:

Lol vanilla is back? <3 we missed you

Please, don't be a kid.

Ken · January 2, 2016

Hmm. I'm thinking a solution for you.

ds_aim · January 2, 2016

You can use chardet to detect the encoding of a string, so one way to convert a list of them to unicode

I will check py35 includes.

Vanilla · January 2, 2016

In this case I guess chardet won't help me. I wanted to know what happened to the string itself. When I check the python object I will always get the same result. Every string in python is a unicode string. Checking them isn't necessary, it's necessary for files etc.. So my python code is actually only capable of creating unicode characters. The problem must be in the binary part :/

If I change the encoding in CIME::GetText to CP_LATIN (which is capable of using ö, ä, ü) the output is always ? for these characters. It won't raise an error. Also the characters ß and ´ are gone though + and ` do work as intended. When I change back to CP_UTF8 those characters also produce weird results.

Ken · January 2, 2016

char * PyString_AsString2(PyObject * pObj)
{
	if (!pObj)
#ifdef __cplusplus > 199711L
		return nullptr;
#else
		return NULL;
#endif
	
#ifdef __cplusplus > 199711L
	char * pszResult2 = nullptr;
#else
	char * pszResult2 = NULL;
#endif

	PyObject * pNewObj = PyUnicode_AsUTF8String(pObj);
	char * pszResult = PyBytes_AsString(pNewObj);
	pszResult2 = strdup(pszResult);
	Py_DecRef(pNewObj);

	return pszResult2;
}

I didn't test it. I just encode it and tried to retrieve as PyBytes_AsString. If you want to be sure totally, you can check PyBytes_AsString. It's returning NULL (in the last version it's returning nullptr)

Kind Regards ~ Ken

Vanilla · January 2, 2016

I guess the problem is located in the IME part. I mean, every ö, ä, ü, etc works fine when I read it from documents (.py for example). But they won't work when I use them inside input elemts. IME is the part that controls this.

And have a look at the function PyObject* imeGetText(PyObject* poSelf, PyObject* poArgs) which only calls an IME function to retreive the string. After this very function call the text already got mangled with. I can't read it neither as LATIN nor UTF-8 properly. The whole IME uses only UTF-8 anymore, I changed the encoding completely there. I didn't even hand it to python and the text still doesn't work as intended. The console shows this. The output there is made before Py_BuildValue is even called. So it should print out the text fine but it doesn't.

@Ken, your function does work, but it has no effect. I changed it back to the original one. The problem isn't located there I guess. Also this is supported due to Py_BuildValue calls an internal conversion from python, not the one provided by PyString_AsString. So this function won't be called anyway in this case. Elsewhere it just works as intended.

The fact that the name already got mangled before it even reaches python makes me believe that it's not a python problem anymore. Still I need to know where I can make sure IME gets the right encoding and the input I make actually uses UTF-8 and not anything else. The whole client has UTF-8 as it's encoding specified. So normally there should be no problems.

miguelmig · January 3, 2016

maybe its just the console that can't display those characters, perhaps try compiling in UNICODE mode

ds_aim · January 3, 2016

11 hours ago, Vanilla said:

I guess the problem is located in the IME part. I mean, every ö, ä, ü, etc works fine when I read it from documents (.py for example). But they won't work when I use them inside input elemts. IME is the part that controls this.

And have a look at the function PyObject* imeGetText(PyObject* poSelf, PyObject* poArgs) which only calls an IME function to retreive the string. After this very function call the text already got mangled with. I can't read it neither as LATIN nor UTF-8 properly. The whole IME uses only UTF-8 anymore, I changed the encoding completely there. I didn't even hand it to python and the text still doesn't work as intended. The console shows this. The output there is made before Py_BuildValue is even called. So it should print out the text fine but it doesn't.

@Ken, your function does work, but it has no effect. I changed it back to the original one. The problem isn't located there I guess. Also this is supported due to Py_BuildValue calls an internal conversion from python, not the one provided by PyString_AsString. So this function won't be called anyway in this case. Elsewhere it just works as intended.

The fact that the name already got mangled before it even reaches python makes me believe that it's not a python problem anymore. Still I need to know where I can make sure IME gets the right encoding and the input I make actually uses UTF-8 and not anything else. The whole client has UTF-8 as it's encoding specified. So normally there should be no problems.

	if (GetDefaultCodePage() == 1256)
	{
		char * p = strchr(buf, ':'); 
		if (p && p[1] == ' ')
			p[1] = 0x08;
	}

You changed encoding in user Interface too ?

#pragma code_page(xxxx)

ds_aim · January 3, 2016

Also you should check IME.cpp

Check this case 1268:

	switch (outCodePage)
	{
		case 1268:
			dataCodePage = CP_UTF8;
			break;
		default:
			dataCodePage = outCodePage;
	}

You should check also IME.cpp

UINT CIME::GetCodePageFromLang(LANGID langid) fuction.

miguelmig · January 3, 2016

You shouldn't use strdup,it is deprecated, use _strdup instead. http://stackoverflow.com/questions/8740500/heap-corruption-with-strdup

Vanilla · January 4, 2016

#pragma code_page() is changed now too! Thanks

The case is already replaced. It will always use UTF-8. Also the GetCodePageFromLang got replaced to use UTF-8 too.

I've made some research and it seems I was quite right about what I said, and not. The input the binary receives is not UTF-8 like I wanted it to be. Instead, it's WINDOWS-1252. The encoding procedure was wrong too.

In the first step when the client receives input, it encodes it from multibyte to widechar (wchar_t). In this case it's imporant not to give it the encoding I wish to have but the encoding in which it's written! Because this step already encodes it to UTF-16, which is the standard encoding for wide chars. I had it to utf-8 which meant that every ö, ä, ü I received was getting mangled to 65535 code: Replacement character for invalid codes. I fixed it and changed it to LATIN first, it seemed to work fine.

I tracked what happened with the variable. The conversion down back to a multibyte happens in the IME.cpp. It mangled with my code again because there was a length issue and maybe the decoding provided by WidecharToMultibyte is just worse. I replaced it and returned a widechar to the PythonIME function. I used boost::locale to convert it manually into utf-8. This was the point where I found out the input was WINDOWS-1252 and not LATIN-1, because it also didn't work with LATIN-encoding to utf-8. Only when I used the windows-encoding, it worked without flaws. And yes, even in the console it does work now. I can see the ö. But in the client there's still a weird sign but it's by far different than what I got before. It's not more than one sign and it looks like a typical case of "ö from latin/windows to utf-8". Now I only need to let the client display it like the console. Python does not raise any errors anymore.

Sign In

[Python 3.5]Encoding error

Recommended Posts

Vanilla 1454

Link to comment

Share on other sites

Top Posters In This Topic

Popular Days

Top Posters In This Topic

Popular Days

Popular Posts

Ken

miguelmig

ds_aim

ds_aim 241

Link to comment

Share on other sites

Vanilla 1454

Link to comment

Share on other sites

Ken 903

Link to comment

Share on other sites

BeHappy4Ever 246

Link to comment

Share on other sites

Vanilla 1454

Link to comment

Share on other sites

ds_aim 241

Link to comment

Share on other sites

Ken 903

Link to comment

Share on other sites

ds_aim 241

Link to comment

Share on other sites

Vanilla 1454

Link to comment

Share on other sites

Ken 903

Link to comment

Share on other sites

Vanilla 1454

Link to comment

Share on other sites

miguelmig 13

Link to comment

Share on other sites

ds_aim 241

Link to comment

Share on other sites

ds_aim 241

Link to comment

Share on other sites

miguelmig 13

Link to comment

Share on other sites

Vanilla 1454

Link to comment

Share on other sites

Please sign in to comment

Similar Content

Important Information