Jump to content

Recommended Posts

Hi there,

 

I'm having some troubles with memory corruption since a few months on my production server.

Sometimes, somewhere, somehow, the server crash on multiple channels at the same time.

 

At first, I got backtraces about crashes occuring while iterating on map / unordered_map of entities.

After a lot of modifications and refactoring, now it's going bad with quests when allocating memory, directly in the liblua.

 

Few weeks ago I saw this old post on stackoverflow detailling my problem like a charm:

https://softwareengineering.stackexchange.com/questions/252696/debugging-memory-corruption

I know, it's a long story, if you don't want to read everything, he explain that he got memory corruption after lots and lots of work on the srcs.

Unable to reproduce a crash on a test server as it seems to need real players to crash.

Crashes occurs randomly, but mostly on the same functions, first was on iterating on entities in maps, after modifications it was somewhere else when items expires...

 

Now I'm in the same position, the production server crashes one or two times a day on multiple channels/cores at the same time, sometimes 10 minutes after a restart, sometimes it doesn't crash for 2 days...

It can crash on any map or any channel. But it's happening mostly on the channel with the most players.

 

I join few backtraces to let you see.

The most recent (today) is the first one on luaM_realloc

new.jpg

old.jpg

old2.jpg

 

Maybe someone of you already had a similar problem and have an idea of how to locate where the corruption happen.

Thanks for your time!

 

Link to comment
Share on other sites

  • Premium

This is the result of the whole set of C ++, that when the game is used, it turns against you.

I'm not sure but I think it's a general memory leak or someone is doing damage to the game from inside to make the memory overflow.

There was monsterchat function xD

 

Are people playing into this server? How many players? You have offline shop?

Link to comment
Share on other sites

Hi, thanks for your answer.

It was a server with few hundreds of players.

Now it's a test server, with a very few players on it, but still it crashes sometimes; not as often now so it's getting difficult to fix.

 

When the server was opened to the players, and out of solution, I tried to disable progressively every systems I made, it still crashed.

I coded myself everything that is on my server, starting from Marty's srcs (thx to him).

 

I have of course my own version of offline shops, searchshop, guild safebox, serverside switchbot, inventory expansion, and so so on. I tried to disable each of them one by one and one to another.

Everything that needs the database uses a cache method.

 

I'm using UTF-8 encoding in everything (database, quests, client, mob/item names... everything). I tried to disable that aswell and still got crashes.

Every LC_TEXT in the game srcs are in english.

 

BSD compiled with C++17, C++4.9 got crashes too.

 

I put a lot of effort and time in these sources and I'm really out of solution right now.

Link to comment
Share on other sites

Hi, thanks again for your answer.

Sorry, I made my shop search system recently so It crashed before that.

 

I don't think it can be caused because of an execution time too long.

(As I said, most of the usage of the database is cached, my search shop system don't use sql queries at all as it access directly the offlineshop cache.)

 

I though at once that it could be a memset or memcpy that overflows, but I already checked several times and couldn't find any of that.

Should I not use thoses functions at all ? If yes what would be the best way to fill / copy memory ?

 

  • Love 1
Link to comment
Share on other sites

  • Bronze

First of all there might more than one reason why your game crashes.

Your cores gives 3 outcomes where:

1. Malfunction from lua level. If you want to put this under deeper diagnosis consider changing optimization flag:
https://docs.oracle.com/cd/E37670_01/E52461/html/ch04s03.html

Then gdb should give you more details.

2. Don't use TR1 - it's gonna be deprecated soon (as I know) and since C++11 standard has been released it's pointless to use TR's features (Technical Report was a bridge between C++03 and C++11). Update your gcc, switch from TR1 to stl. That will probably solve this error (worked for me).

3. The third error is a bit tricky and might be tough to figure. Your core shows that there was not enough memory to be allocated thus that strange abort. I would not consider that part as an error-prone - that was probably random code area where system ran out of memory. You should look out for a leak somewhere else.

And additional question. Did you disable checkpointing? Is the answer is 'yes' switch it back immediately.
If you won't be able to solve those crashes and will be really eager to get it done - send me PM. Keep in mind tho that if there is gonna be a lot of diagnosis coming out I won't do for free.
Good luck

  • Love 2
Link to comment
Share on other sites

Hi Sherer, thank you very much for your detailed answer.

 

I'm on O2 flag by default, I remember changing that already but didn't get much better results in the diagnosis so I put it back to normal.

As it's a test server now, I'll try to keep it to default O0 for now on.

Here are the compiler flags I use then:

-m32 -g -Wall -O -pipe -fexceptions -std=gnu++17 -fno-strict-aliasing -pthread -D_THREAD_SAFE -DNDEBUG -fstack-protector-all

 

The second error, with tr1 lists was where the crashes first started.

I then did change all the tr1 lists in the sources to std lists, and did the same for every boost lists.

I then got erreor with affects, I remember removing the boost affect_pool in affect.cpp, to make it use the default M2 allocator as if DEBUG_ALLOC was declared in this file.

 

 

For the third error, I remember having some troubles with the memory usage of the server back then.

I'm not sure if it's exactly at that time, but the memory usage of my machine was nearly at 100% even with the game not started.

The machine had not been restarted for years. Until now after the restart the memory usage is fine.

At this time, I though that the memory corruption could be because of a memory problem on my server, so I rent a new machine but still got crashes on it.

 

But, the error seems to be part of the first one, that I got yesterday after 35 days without reboot and crashes (and very very few players on it so it's not really showing that it crashes less than before).

I did an update, so I restarted the game, let some 3-5 players try it and got a crash after a few hours while a player attempted the refinement of ores on a guild alchemist.

It worked 3 times, and then it crashed giving a backtrace to luaM_realloc because it couldn't allocate memory.

 

 

For the checkpointing, everything is as it is by default, I haven't touched anything about that.

To be sure, I'll only be able to check tomorrow.

 

Thanks for the proposal, I'll keep that in mind.

Link to comment
Share on other sites

  • Premium

I would recommend you to try building your cores with ASAN (address sanitizer) enabled. I can't promise anything, but it can help you in desperate times... I remember having the weirdest crashes when AE opened. There was some heap-use-after-free error related to our multilanguage and battlepass systems that we would never been able to find without ASAN. (I don't think this is the case right here, just wanted to mention it as an example.) I think its worth a try. There are some nice article about it on the internet so I'm sure you can manage it, but if you have some questions about it I will try to answer them (despite I'm not an expert user of it).

  • Love 2

The one and only UI programming guideline

Link to comment
Share on other sites

  • Bronze
15 godzin temu, ElRenardo napisał:

Hi Sherer, thank you very much for your detailed answer.

 

I'm on O2 flag by default, I remember changing that already but didn't get much better results in the diagnosis so I put it back to normal.

As it's a test server now, I'll try to keep it to default O0 for now on.

Here are the compiler flags I use then:


-m32 -g -Wall -O -pipe -fexceptions -std=gnu++17 -fno-strict-aliasing -pthread -D_THREAD_SAFE -DNDEBUG -fstack-protector-all

 

The second error, with tr1 lists was where the crashes first started.

I then did change all the tr1 lists in the sources to std lists, and did the same for every boost lists.

I then got erreor with affects, I remember removing the boost affect_pool in affect.cpp, to make it use the default M2 allocator as if DEBUG_ALLOC was declared in this file.

 

 

For the third error, I remember having some troubles with the memory usage of the server back then.

I'm not sure if it's exactly at that time, but the memory usage of my machine was nearly at 100% even with the game not started.

The machine had not been restarted for years. Until now after the restart the memory usage is fine.

At this time, I though that the memory corruption could be because of a memory problem on my server, so I rent a new machine but still got crashes on it.

 

But, the error seems to be part of the first one, that I got yesterday after 35 days without reboot and crashes (and very very few players on it so it's not really showing that it crashes less than before).

I did an update, so I restarted the game, let some 3-5 players try it and got a crash after a few hours while a player attempted the refinement of ores on a guild alchemist.

It worked 3 times, and then it crashed giving a backtrace to luaM_realloc because it couldn't allocate memory.

 

 

For the checkpointing, everything is as it is by default, I haven't touched anything about that.

To be sure, I'll only be able to check tomorrow.

 

Thanks for the proposal, I'll keep that in mind.

I don't think pooling is used by default (DEBUG_ALLOC should be disabled in release mode). If you keep std instead of TR1 is good tho.
If there wasn't any crash throughout those 35 days where there was no player on your server that probably means that error is linked some player-depended stuff.

@masodikbela has came up with right idea. You can try to perform some memory leak diagnose using ASAN or valgrind (depeneds on you):

https://github.com/google/sanitizers/wiki/AddressSanitizer
http://www.valgrind.org/docs/manual/quick-start.html

On the other hand you can merge your source into windows and use visual studio's built-in profiler:
 

https://docs.microsoft.com/en-us/visualstudio/profiling/memory-usage?view=vs-2019

 

  • Love 1
Link to comment
Share on other sites

Hi, thanks again !

 

Alright, so, I tried to install valgrind at first because of the easier usage and encountered this error at the start of the game processes with valgrind:

valgrind: I failed to allocate space for the application's stack.
valgrind: This may be the result of a very large --main-stacksize=
valgrind: setting.  Cannot continue.  Sorry.

I then tried to give the --main-stacksize argument with different values and it still gives me back this error.

Maybe some of you have a solution ?

# pkg info valgrind
valgrind-3.10.1.20160113_7,1
Name           : valgrind
Version        : 3.10.1.20160113_7,1
Installed on   : Tue Jun 11 08:41:47 2019 CEST
Origin         : devel/valgrind
Architecture   : FreeBSD:11:amd64
Prefix         : /usr/local
Categories     : devel
Licenses       : GPLv2
Maintainer     : [email protected]
WWW            : https://bitbucket.org/stass/valgrind-freebsd/overview
Comment        : Memory debugging and profiling tool
Options        :
        32BIT          : on
        DOCS           : on
        MANPAGES       : on
        MPI            : off
Annotations    :
        FreeBSD_version: 1102000
        repo_type      : binary
        repository     : FreeBSD

My system:

11.2-RELEASE-p9 FreeBSD 11.2-RELEASE-p9 #0: Tue Feb  5 15:30:36 UTC 2019     [email protected]:/usr/obj/usr/src/sys/GENERIC  amd64

 

Now I'm going to try with ASAN.

Link to comment
Share on other sites

  • Premium
20 hours ago, ElRenardo said:

Hi, thanks for your answer.

I'm not familiar with this kind of things.

 

It's like a library that you link to your project ?

Any advice on what flags to use to get the best results ?

 

I'll try that tomorrow aswell, thanks again.

Its a built-in feature for clang, and if you are using clang you just have to pass a few more arguments for the build. (I think its working/integrated with gcc too, but I'm totally not sure about this since I always used clang and barely worked with gcc) I'm using a very default settings for ASAN, the related arguments are:

CFLAGS += -fsanitize=address -fsanitize-recover=address -fno-omit-frame-pointer

If you have ASAN enabled you must note that

  • it will use much much more memory (like 2-3 times as much as usually)
  • the program will always abort/crash on the first anomaly. Its necessary since if it continues there would be a chance that it would produce false results... this could be annoying sometimes if there are more bad parts in your code, cous it could happen that you want to fix x, but you have to fix y, z, i, j, k before, even if they are not related to your main problem.
  • the program will not generate core dump, instead it will write the result to the stderr, so you might want to run the program without vrunner and detour the output to a file (instead of writing it to the console)

I can't help with valgrind since this is the first time I heard about it.

@Sherer Thats not true for the affect pool. Its enabled by default, because its #ifndef DEBUG_ALLOC not #ifdef DEBUG_ALLOC.
sT3k6mx.png

Edited by Metin2 Dev
Core X - External 2 Internal
  • Love 2

The one and only UI programming guideline

Link to comment
Share on other sites

  • Bronze
8 godzin temu, ElRenardo napisał:

Hi, thanks again !

 

Alright, so, I tried to install valgrind at first because of the easier usage and encountered this error at the start of the game processes with valgrind:


valgrind: I failed to allocate space for the application's stack.
valgrind: This may be the result of a very large --main-stacksize=
valgrind: setting.  Cannot continue.  Sorry.

I then tried to give the --main-stacksize argument with different values and it still gives me back this error.

Maybe some of you have a solution ?


# pkg info valgrind
valgrind-3.10.1.20160113_7,1
Name           : valgrind
Version        : 3.10.1.20160113_7,1
Installed on   : Tue Jun 11 08:41:47 2019 CEST
Origin         : devel/valgrind
Architecture   : FreeBSD:11:amd64
Prefix         : /usr/local
Categories     : devel
Licenses       : GPLv2
Maintainer     : [email protected]
WWW            : https://bitbucket.org/stass/valgrind-freebsd/overview
Comment        : Memory debugging and profiling tool
Options        :
        32BIT          : on
        DOCS           : on
        MANPAGES       : on
        MPI            : off
Annotations    :
        FreeBSD_version: 1102000
        repo_type      : binary
        repository     : FreeBSD

My system:


11.2-RELEASE-p9 FreeBSD 11.2-RELEASE-p9 #0: Tue Feb  5 15:30:36 UTC 2019     [email protected]:/usr/obj/usr/src/sys/GENERIC  amd64

 

Now I'm going to try with ASAN.

How much RAM have you got on your vps?

Link to comment
Share on other sites

Please sign in to comment

You will be able to leave a comment after signing in



Sign In Now

Announcements



×
×
  • Create New...

Important Information

Terms of Use / Privacy Policy / Guidelines / We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.