Vanilla 1454 Posted January 2, 2016 Share Posted January 2, 2016 (edited) Hello metin2dev, I wanted to ask for help regarding encodings. There's a lot of trouble and one big thing when I migrated to python 3.5 on the client is the encoding. I managed to make all the ui files etc. work with utf-8. The results are clear. You can see everything, even characters like ö, ä, ü. In German language, they're very important. So what's the big deal? The problem is the input. Everytime I tried to input such characters, the text box stopped working. The reason for this is an encoding error python responds with. It tells me that utf-8 can't decode these bytes. I checked the IME part on the client source after that. I've made some research and soon I changed the whole encoding in the source to UTF-8, except for the WebBrowser project which still sticks to CP_ACP. But that doesn't make a difference anyway here. The textbox started working and there are no encoding errors more. Instead, as you can see in the picture, the following weird result shows up. As you can see I'm debugging the client and added a line to make sure it prints the GetText results from IME before handing them over to python. This symbol happens to be there when I try to type an ö character. I guess there's some messing up with encodings and therefore this weird symbol appears. Maybe you have an idea what could've caused the error. As already stated the whole client should use UTF-8 now. Thank you for your time! Best regards, Vanilla Edited August 24, 2022 by Metin2 Dev Core X - External 2 Internal We are the tortured. We're not your friends. As long as we're not visible. We are unfixable. Link to comment Share on other sites More sharing options...
ds_aim 241 Posted January 2, 2016 Share Posted January 2, 2016 Locale.cpp int MULTI_LOCALE_CODE = 949; Link to comment Share on other sites More sharing options...
Vanilla 1454 Posted January 2, 2016 Author Share Posted January 2, 2016 Already had a check on that. It's changed. locale.cpp int MULTI_LOCALE_CODE = CP_UTF8; We are the tortured. We're not your friends. As long as we're not visible. We are unfixable. Link to comment Share on other sites More sharing options...
Ken 903 Posted January 2, 2016 Share Posted January 2, 2016 In 3.x version of python, Python started to use Unicode string as I know. Did you try to convert it? Everything else is a fake! It's nice to hear this. Welcome back Vanilla. Kind Regards ~ Ken Do not be sorry, be better. Link to comment Share on other sites More sharing options...
BeHappy4Ever 246 Posted January 2, 2016 Share Posted January 2, 2016 Lol vanilla is back? <3 we missed you Link to comment Share on other sites More sharing options...
Vanilla 1454 Posted January 2, 2016 Author Share Posted January 2, 2016 I did. Every string gets encoded. These are the functions I use for string encoding: Spoiler char* PyString_AsString(PyObject* v) { char* my_result = NULL; if (PyUnicode_Check(v)) { PyObject * temp_bytes = PyUnicode_AsEncodedString(v, "utf-8", "ignore"); // Owned reference if (temp_bytes != NULL) { my_result = PyBytes_AS_STRING(temp_bytes); // Borrowed pointer my_result = strdup(my_result); Py_DECREF(temp_bytes); } } else if (PyBytes_Check(v)) { my_result = PyBytes_AS_STRING(v); // Borrowed pointer my_result = strdup(my_result); } else { PyObject* str_exc_type = PyObject_Repr(v); PyObject* pystr = PyUnicode_AsEncodedString(str_exc_type, "utf-8", "ignore"); my_result = PyBytes_AS_STRING(pystr); Py_XDECREF(str_exc_type); Py_XDECREF(pystr); Tracef("Error with encoding!!!\r\n"); } return my_result; } char* PyString_AsStringUtils(PyObject* v) { char* my_result = NULL; if (PyUnicode_Check(v)) { PyObject * temp_bytes = PyUnicode_AsEncodedString(v, "utf-8", "ignore"); // Owned reference if (temp_bytes != NULL) { my_result = PyBytes_AS_STRING(temp_bytes); // Borrowed pointer my_result = strdup(my_result); Py_DECREF(temp_bytes); } } else if (PyBytes_Check(v)) { my_result = PyBytes_AS_STRING(v); // Borrowed pointer my_result = strdup(my_result); } else { Tracef("Error with encoding!!!"); } return my_result; } Maybe I should change the ignore flag to strict and see if errors occur :/ We are the tortured. We're not your friends. As long as we're not visible. We are unfixable. Link to comment Share on other sites More sharing options...
ds_aim 241 Posted January 2, 2016 Share Posted January 2, 2016 10 minutes ago, BeHappy4Ever said: Lol vanilla is back? <3 we missed you Please, don't be a kid. Link to comment Share on other sites More sharing options...
Ken 903 Posted January 2, 2016 Share Posted January 2, 2016 Hmm. I'm thinking a solution for you. Do not be sorry, be better. Link to comment Share on other sites More sharing options...
ds_aim 241 Posted January 2, 2016 Share Posted January 2, 2016 You can use chardet to detect the encoding of a string, so one way to convert a list of them to unicode I will check py35 includes. Link to comment Share on other sites More sharing options...
Vanilla 1454 Posted January 2, 2016 Author Share Posted January 2, 2016 In this case I guess chardet won't help me. I wanted to know what happened to the string itself. When I check the python object I will always get the same result. Every string in python is a unicode string. Checking them isn't necessary, it's necessary for files etc.. So my python code is actually only capable of creating unicode characters. The problem must be in the binary part :/ If I change the encoding in CIME::GetText to CP_LATIN (which is capable of using ö, ä, ü) the output is always ? for these characters. It won't raise an error. Also the characters ß and ´ are gone though + and ` do work as intended. When I change back to CP_UTF8 those characters also produce weird results. We are the tortured. We're not your friends. As long as we're not visible. We are unfixable. Link to comment Share on other sites More sharing options...
Ken 903 Posted January 2, 2016 Share Posted January 2, 2016 char * PyString_AsString2(PyObject * pObj) { if (!pObj) #ifdef __cplusplus > 199711L return nullptr; #else return NULL; #endif #ifdef __cplusplus > 199711L char * pszResult2 = nullptr; #else char * pszResult2 = NULL; #endif PyObject * pNewObj = PyUnicode_AsUTF8String(pObj); char * pszResult = PyBytes_AsString(pNewObj); pszResult2 = strdup(pszResult); Py_DecRef(pNewObj); return pszResult2; } I didn't test it. I just encode it and tried to retrieve as PyBytes_AsString. If you want to be sure totally, you can check PyBytes_AsString. It's returning NULL (in the last version it's returning nullptr) Kind Regards ~ Ken 1 Do not be sorry, be better. Link to comment Share on other sites More sharing options...
Vanilla 1454 Posted January 2, 2016 Author Share Posted January 2, 2016 I guess the problem is located in the IME part. I mean, every ö, ä, ü, etc works fine when I read it from documents (.py for example). But they won't work when I use them inside input elemts. IME is the part that controls this. And have a look at the function PyObject* imeGetText(PyObject* poSelf, PyObject* poArgs) which only calls an IME function to retreive the string. After this very function call the text already got mangled with. I can't read it neither as LATIN nor UTF-8 properly. The whole IME uses only UTF-8 anymore, I changed the encoding completely there. I didn't even hand it to python and the text still doesn't work as intended. The console shows this. The output there is made before Py_BuildValue is even called. So it should print out the text fine but it doesn't. @Ken, your function does work, but it has no effect. I changed it back to the original one. The problem isn't located there I guess. Also this is supported due to Py_BuildValue calls an internal conversion from python, not the one provided by PyString_AsString. So this function won't be called anyway in this case. Elsewhere it just works as intended. The fact that the name already got mangled before it even reaches python makes me believe that it's not a python problem anymore. Still I need to know where I can make sure IME gets the right encoding and the input I make actually uses UTF-8 and not anything else. The whole client has UTF-8 as it's encoding specified. So normally there should be no problems. We are the tortured. We're not your friends. As long as we're not visible. We are unfixable. Link to comment Share on other sites More sharing options...
miguelmig 13 Posted January 3, 2016 Share Posted January 3, 2016 maybe its just the console that can't display those characters, perhaps try compiling in UNICODE mode 1 Link to comment Share on other sites More sharing options...
ds_aim 241 Posted January 3, 2016 Share Posted January 3, 2016 11 hours ago, Vanilla said: I guess the problem is located in the IME part. I mean, every ö, ä, ü, etc works fine when I read it from documents (.py for example). But they won't work when I use them inside input elemts. IME is the part that controls this. And have a look at the function PyObject* imeGetText(PyObject* poSelf, PyObject* poArgs) which only calls an IME function to retreive the string. After this very function call the text already got mangled with. I can't read it neither as LATIN nor UTF-8 properly. The whole IME uses only UTF-8 anymore, I changed the encoding completely there. I didn't even hand it to python and the text still doesn't work as intended. The console shows this. The output there is made before Py_BuildValue is even called. So it should print out the text fine but it doesn't. @Ken, your function does work, but it has no effect. I changed it back to the original one. The problem isn't located there I guess. Also this is supported due to Py_BuildValue calls an internal conversion from python, not the one provided by PyString_AsString. So this function won't be called anyway in this case. Elsewhere it just works as intended. The fact that the name already got mangled before it even reaches python makes me believe that it's not a python problem anymore. Still I need to know where I can make sure IME gets the right encoding and the input I make actually uses UTF-8 and not anything else. The whole client has UTF-8 as it's encoding specified. So normally there should be no problems. if (GetDefaultCodePage() == 1256) { char * p = strchr(buf, ':'); if (p && p[1] == ' ') p[1] = 0x08; } You changed encoding in user Interface too ? #pragma code_page(xxxx) 1 Link to comment Share on other sites More sharing options...
ds_aim 241 Posted January 3, 2016 Share Posted January 3, 2016 Also you should check IME.cpp Check this case 1268: switch (outCodePage) { case 1268: dataCodePage = CP_UTF8; break; default: dataCodePage = outCodePage; } You should check also IME.cpp UINT CIME::GetCodePageFromLang(LANGID langid) fuction. 1 Link to comment Share on other sites More sharing options...
miguelmig 13 Posted January 3, 2016 Share Posted January 3, 2016 You shouldn't use strdup,it is deprecated, use _strdup instead. http://stackoverflow.com/questions/8740500/heap-corruption-with-strdup Link to comment Share on other sites More sharing options...
Vanilla 1454 Posted January 4, 2016 Author Share Posted January 4, 2016 #pragma code_page() is changed now too! Thanks The case is already replaced. It will always use UTF-8. Also the GetCodePageFromLang got replaced to use UTF-8 too. I've made some research and it seems I was quite right about what I said, and not. The input the binary receives is not UTF-8 like I wanted it to be. Instead, it's WINDOWS-1252. The encoding procedure was wrong too. In the first step when the client receives input, it encodes it from multibyte to widechar (wchar_t). In this case it's imporant not to give it the encoding I wish to have but the encoding in which it's written! Because this step already encodes it to UTF-16, which is the standard encoding for wide chars. I had it to utf-8 which meant that every ö, ä, ü I received was getting mangled to 65535 code: Replacement character for invalid codes. I fixed it and changed it to LATIN first, it seemed to work fine. I tracked what happened with the variable. The conversion down back to a multibyte happens in the IME.cpp. It mangled with my code again because there was a length issue and maybe the decoding provided by WidecharToMultibyte is just worse. I replaced it and returned a widechar to the PythonIME function. I used boost::locale to convert it manually into utf-8. This was the point where I found out the input was WINDOWS-1252 and not LATIN-1, because it also didn't work with LATIN-encoding to utf-8. Only when I used the windows-encoding, it worked without flaws. And yes, even in the console it does work now. I can see the ö. But in the client there's still a weird sign but it's by far different than what I got before. It's not more than one sign and it looks like a typical case of "ö from latin/windows to utf-8". Now I only need to let the client display it like the console. Python does not raise any errors anymore. We are the tortured. We're not your friends. As long as we're not visible. We are unfixable. Link to comment Share on other sites More sharing options...
Recommended Posts
Please sign in to comment
You will be able to leave a comment after signing in
Sign In Now