Jump to content

[Python 3.5]Encoding error


Recommended Posts

Hello metin2dev,

I wanted to ask for help regarding encodings. There's a lot of trouble and one big thing when I migrated to python 3.5 on the client is the encoding. I managed to make all the ui files etc. work with utf-8.

The results are clear. You can see everything, even characters like ö, ä, ü. In German language, they're very important. So what's the big deal? The problem is the input. Everytime I tried to input such characters, the text box stopped working. The reason for this is an encoding error python responds with. It tells me that utf-8 can't decode these bytes. I checked the IME part on the client source after that.

I've made some research and soon I changed the whole encoding in the source to UTF-8, except for the WebBrowser project which still sticks to CP_ACP. But that doesn't make a difference anyway here. The textbox started working and there are no encoding errors more. Instead, as you can see in the picture, the following weird result shows up.

 

err.png

As you can see I'm debugging the client and added a line to make sure it prints the GetText results from IME before handing them over to python. This symbol happens to be there when I try to type an ö character. I guess there's some messing up with encodings and therefore this weird symbol appears. Maybe you have an idea what could've caused the error. As already stated the whole client should use UTF-8 now.

Thank you for your time!

 

Best regards,

Vanilla

Edited by Metin2 Dev
Core X - External 2 Internal

We are the tortured.
We're not your friends.
As long as we're not visible.
We are unfixable.

Link to comment
Share on other sites

I did. Every string gets encoded.

These are the functions I use for string encoding:

Spoiler

 

char* PyString_AsString(PyObject* v)
{
    char* my_result = NULL;
    if (PyUnicode_Check(v)) {
        PyObject * temp_bytes = PyUnicode_AsEncodedString(v, "utf-8", "ignore"); // Owned reference
        if (temp_bytes != NULL) {
            my_result = PyBytes_AS_STRING(temp_bytes); // Borrowed pointer
            my_result = strdup(my_result);
            Py_DECREF(temp_bytes);
        }
    }
    else if (PyBytes_Check(v)) {
        my_result = PyBytes_AS_STRING(v); // Borrowed pointer
        my_result = strdup(my_result);
    }
    else {
        PyObject* str_exc_type = PyObject_Repr(v);
        PyObject* pystr = PyUnicode_AsEncodedString(str_exc_type, "utf-8", "ignore");
        my_result = PyBytes_AS_STRING(pystr);
        Py_XDECREF(str_exc_type);
        Py_XDECREF(pystr);
        Tracef("Error with encoding!!!\r\n");
    }
    return my_result;
}

 

char* PyString_AsStringUtils(PyObject* v)
{
    char* my_result = NULL;
    if (PyUnicode_Check(v)) {
        PyObject * temp_bytes = PyUnicode_AsEncodedString(v, "utf-8", "ignore"); // Owned reference
        if (temp_bytes != NULL) {
            my_result = PyBytes_AS_STRING(temp_bytes); // Borrowed pointer
            my_result = strdup(my_result);
            Py_DECREF(temp_bytes);
        }
    }
    else if (PyBytes_Check(v)) {
        my_result = PyBytes_AS_STRING(v); // Borrowed pointer
        my_result = strdup(my_result);
    }
    else {
        Tracef("Error with encoding!!!");
    }
    return my_result;
}

 

 

Maybe I should change the ignore flag to strict and see if errors occur :/

We are the tortured.
We're not your friends.
As long as we're not visible.
We are unfixable.

Link to comment
Share on other sites

In this case I guess chardet won't help me. I wanted to know what happened to the string itself. When I check the python object I will always get the same result. Every string in python is a unicode string. Checking them isn't necessary, it's necessary for files etc.. So my python code is actually only capable of creating unicode characters. The problem must be in the binary part :/

 

If I change the encoding in CIME::GetText to CP_LATIN (which is capable of using ö, ä, ü) the output is always ? for these characters. It won't raise an error. Also the characters ß and ´ are gone though + and ` do work as intended. When I change back to CP_UTF8 those characters also produce weird results.

We are the tortured.
We're not your friends.
As long as we're not visible.
We are unfixable.

Link to comment
Share on other sites

char * PyString_AsString2(PyObject * pObj)
{
	if (!pObj)
#ifdef __cplusplus > 199711L
		return nullptr;
#else
		return NULL;
#endif
	
#ifdef __cplusplus > 199711L
	char * pszResult2 = nullptr;
#else
	char * pszResult2 = NULL;
#endif

	PyObject * pNewObj = PyUnicode_AsUTF8String(pObj);
	char * pszResult = PyBytes_AsString(pNewObj);
	pszResult2 = strdup(pszResult);
	Py_DecRef(pNewObj);

	return pszResult2;
}

I didn't test it. I just encode it and tried to retrieve as PyBytes_AsString. If you want to be sure totally, you can check PyBytes_AsString. It's returning NULL (in the last version it's returning nullptr)

Kind Regards ~ Ken

  • Love 1

Do not be sorry, be better.

Link to comment
Share on other sites

I guess the problem is located in the IME part. I mean, every ö, ä, ü, etc works fine when I read it from documents (.py for example). But they won't work when I use them inside input elemts. IME is the part that controls this.

And have a look at the function PyObject* imeGetText(PyObject* poSelf, PyObject* poArgs) which only calls an IME function to retreive the string. After this very function call the text already got mangled with. I can't read it neither as LATIN nor UTF-8 properly. The whole IME uses only UTF-8 anymore, I changed the encoding completely there. I didn't even hand it to python and the text still doesn't work as intended. The console shows this. The output there is made before Py_BuildValue is even called. So it should print out the text fine but it doesn't.

 

@Ken, your function does work, but it has no effect. I changed it back to the original one. The problem isn't located there I guess. Also this is supported due to Py_BuildValue calls an internal conversion from python, not the one provided by PyString_AsString. So this function won't be called anyway in this case. Elsewhere it just works as intended.

The fact that the name already got mangled before it even reaches python makes me believe that it's not a python problem anymore. Still I need to know where I can make sure IME gets the right encoding and the input I make actually uses UTF-8 and not anything else. The whole client has UTF-8 as it's encoding specified. So normally there should be no problems.

We are the tortured.
We're not your friends.
As long as we're not visible.
We are unfixable.

Link to comment
Share on other sites

11 hours ago, Vanilla said:

I guess the problem is located in the IME part. I mean, every ö, ä, ü, etc works fine when I read it from documents (.py for example). But they won't work when I use them inside input elemts. IME is the part that controls this.

And have a look at the function PyObject* imeGetText(PyObject* poSelf, PyObject* poArgs) which only calls an IME function to retreive the string. After this very function call the text already got mangled with. I can't read it neither as LATIN nor UTF-8 properly. The whole IME uses only UTF-8 anymore, I changed the encoding completely there. I didn't even hand it to python and the text still doesn't work as intended. The console shows this. The output there is made before Py_BuildValue is even called. So it should print out the text fine but it doesn't.

 

@Ken, your function does work, but it has no effect. I changed it back to the original one. The problem isn't located there I guess. Also this is supported due to Py_BuildValue calls an internal conversion from python, not the one provided by PyString_AsString. So this function won't be called anyway in this case. Elsewhere it just works as intended.

The fact that the name already got mangled before it even reaches python makes me believe that it's not a python problem anymore. Still I need to know where I can make sure IME gets the right encoding and the input I make actually uses UTF-8 and not anything else. The whole client has UTF-8 as it's encoding specified. So normally there should be no problems.

	if (GetDefaultCodePage() == 1256)
	{
		char * p = strchr(buf, ':'); 
		if (p && p[1] == ' ')
			p[1] = 0x08;
	}

 

 

 

You changed encoding in user Interface  too ?

#pragma code_page(xxxx)

  • Love 1
Link to comment
Share on other sites

#pragma code_page() is changed now too! Thanks ;)

The case is already replaced. It will always use UTF-8. Also the GetCodePageFromLang got replaced to use UTF-8 too.

I've made some research and it seems I was quite right about what I said, and not. The input the binary receives is not UTF-8 like I wanted it to be. Instead, it's WINDOWS-1252. The encoding procedure was wrong too.

In the first step when the client receives input, it encodes it from multibyte to widechar (wchar_t). In this case it's imporant not to give it the encoding I wish to have but the encoding in which it's written! Because this step already encodes it to UTF-16, which is the standard encoding for wide chars. I had it to utf-8 which meant that every ö, ä, ü I received was getting mangled to 65535 code: Replacement character for invalid codes. I fixed it and changed it to LATIN first, it seemed to work fine.

I tracked what happened with the variable. The conversion down back to a multibyte happens in the IME.cpp. It mangled with my code again because there was a length issue and maybe the decoding provided by WidecharToMultibyte is just worse. I replaced it and returned a widechar to the PythonIME function. I used boost::locale to convert it manually into utf-8. This was the point where I found out the input was WINDOWS-1252 and not LATIN-1, because it also didn't work with LATIN-encoding to utf-8. Only when I used the windows-encoding, it worked without flaws. And yes, even in the console it does work now. I can see the ö. But in the client there's still a weird sign but it's by far different than what I got before. It's not more than one sign and it looks like a typical case of "ö from latin/windows to utf-8". Now I only need to let the client display it like the console. Python does not raise any errors anymore.

We are the tortured.
We're not your friends.
As long as we're not visible.
We are unfixable.

Link to comment
Share on other sites

Please sign in to comment

You will be able to leave a comment after signing in



Sign In Now

Announcements



×
×
  • Create New...

Important Information

Terms of Use / Privacy Policy / Guidelines / We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.