OOM #4943

New Issue

AliasAlreadyTaken · 2023-07-15T14:33:41+02:00

AliasAlreadyTaken commented

2023-07-15 14:33:41 +02:00

[711040.808986] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/system.slice/cron.service,task=minetestserver,pid=1894647,uid=1001
[711040.809254] Out of memory: Killed process 1894647 (minetestserver) total-vm:54905192kB, anon-rss:38028856kB, file-rss:2304kB, shmem-rss:0kB, UID:1001 pgtables:106184kB oom_score_adj:0
[711044.436208] oom_reaper: reaped process 1894647 (minetestserver), now anon-rss:416kB, file-rss:792kB, shmem-rss:0kB

[711040.808986] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/system.slice/cron.service,task=minetestserver,pid=1894647,uid=1001 [711040.809254] Out of memory: Killed process 1894647 (minetestserver) total-vm:54905192kB, anon-rss:38028856kB, file-rss:2304kB, shmem-rss:0kB, UID:1001 pgtables:106184kB oom_score_adj:0 [711044.436208] oom_reaper: reaped process 1894647 (minetestserver), now anon-rss:416kB, file-rss:792kB, shmem-rss:0kB

sixer commented

2023-07-15 15:02:32 +02:00

pgtables:106184kB

could turning on or tweaking (transparent) huge pages help here?

> pgtables:106184kB could turning on or tweaking (transparent) huge pages help here?

DragonWrangler commented

2023-07-16 03:07:12 +02:00

I assume this is because of the voice attack earlier ?

Boot commented

2023-07-16 07:27:54 +02:00

Before the server crashed I saw Ravise fighting with three Voicers, ran there and saw, they were all freeze. I could ran for minimum twenty seconds around there and out of the danger zone, because I am knowing what will happen soon, before the server goes down.

flux commented

2023-07-17 05:18:23 +02:00

most minetest mods are pretty memory conscious, because luajit used to have a 1GiB limit. it no longer has that limit, but it'd be useful to see how much memory lua was using before the crash, before blaming a mod for this issue. we log how much memory lua is using every 30 seconds or something.

most minetest mods are pretty memory conscious, because luajit used to have a 1GiB limit. it no longer has that limit, but it'd be useful to see how much memory *lua* was using before the crash, before blaming a mod for this issue. we log how much memory lua is using every 30 seconds or something.

flux added the

1. kind/bug

2. prio/critical

labels 2023-07-17 05:18:46 +02:00

AliasAlreadyTaken commented

2023-07-22 19:30:52 +02:00

![grafik](/attachments/71c56549-41ac-451e-9aaa-f050e992fd70)

grafik.png

21 KiB

AliasAlreadyTaken commented

2023-07-22 19:33:33 +02:00

![grafik](/attachments/3246b4bf-28e8-4268-8407-d251ad3ec29a)

grafik.png

50 KiB

AliasAlreadyTaken commented

2023-07-22 20:54:06 +02:00

Is there a way to see the memory allocation of the engine compared to that of the mods?

Most likely yes, that's what flux did in what logs "lua is using ... MB". But ...

Is there a way to see the memory allocation of a single mod?
What's lj_alloc_free and release_unused_segments ? What's gc_sweep?

1. Is there a way to see the memory allocation of the engine compared to that of the mods? Most likely yes, that's what flux did in what logs "lua is using ... MB". But ... 2. Is there a way to see the memory allocation of a single mod? 3. What's lj_alloc_free and release_unused_segments ? What's gc_sweep?

flux commented

2023-07-23 07:47:29 +02:00

Is there a way to see the memory allocation of the engine compared to that of the mods?

we can see what lua has allocated. i doubt we have a major lua memory leak. we log how much memory the lua environment is using already, see #4249. if you've got privs, you can also see how much memory lua is using w/ the /memory command. i don't remember ever seeing YL getting above 1GiB.

Is there a way to see the memory allocation of a single mod?

while implementing the spawnit mod, i've created a tool to approximate how much memory a particular data structure is using. i can't guarantee how accurate it is, and it's very expensive to use on complicated data structures.

see https://github.com/fluxionary/minetest-futil/blob/main/util/memory.lua

What's lj_alloc_free and release_unused_segments ? What's gc_sweep?

i don't know what any of these terms mean exactly rn. the terms seem like things i should get to know better. gc_sweep makes me think of java and lisp, though shrug

> 1. Is there a way to see the memory allocation of the engine compared to that of the mods? we can see what lua has allocated. i doubt we have a major lua memory leak. we log how much memory the lua environment is using already, see #4249. if you've got privs, you can also see how much memory lua is using w/ the `/memory` command. i don't remember ever seeing YL getting above 1GiB. > 2. Is there a way to see the memory allocation of a single mod? while implementing the spawnit mod, i've created a tool to *approximate* how much memory a particular data structure is using. i can't guarantee how accurate it is, and it's *very* expensive to use on complicated data structures. see https://github.com/fluxionary/minetest-futil/blob/main/util/memory.lua > 3. What's lj_alloc_free and release_unused_segments ? What's gc_sweep? i don't know what any of these terms mean exactly rn. the terms seem like things i should get to know better. gc_sweep makes me think of java and lisp, though *shrug*

sixer commented

2023-07-23 09:26:38 +02:00

gc_sweep would probably be part of garbage collector in lua

whosit commented

2023-07-23 10:39:45 +02:00

What's lj_alloc_free and release_unused_segments ? What's gc_sweep?

Those 3 are luajit functions:

/* Partial sweep of a GC list. */
static GCRef *gc_sweep(global_State *g, GCRef *p, uint32_t lim) { ... }

as comment says. Seems to come from "incremental mark and sweep garbage collection strategy".

static LJ_NOINLINE void *lj_alloc_free(void *msp, void *ptr) { ... }

No comment, but my guess: marks allocated memory chuck as unused and consolidates it with surrounding unused chunks.

/* Unmap and unlink any mmapped segments that don't contain used chunks */
static size_t release_unused_segments(mstate m) { ... }

Purely just speculation:

If we are sure (are we?) that it's not lua memory usage, maybe memory is leaking on C side of things, but caused by some lua code creating massive amounts of objects (that are allocated/freed lua-side?). Flux mentioned entites trying to spawn in inactive mapblocks and disappearing, can that be the cause? Maybe some mod spamming add_entity() on each globalstep if it fails? Or something similar?

> 3. What's lj_alloc_free and release_unused_segments ? What's gc_sweep? Those 3 are luajit functions: ```c /* Partial sweep of a GC list. */ static GCRef *gc_sweep(global_State *g, GCRef *p, uint32_t lim) { ... } ``` as comment says. Seems to come from "incremental mark and sweep garbage collection strategy". ```c static LJ_NOINLINE void *lj_alloc_free(void *msp, void *ptr) { ... } ``` No comment, but my guess: marks allocated memory chuck as unused and consolidates it with surrounding unused chunks. ```c /* Unmap and unlink any mmapped segments that don't contain used chunks */ static size_t release_unused_segments(mstate m) { ... } ``` Purely just speculation: If we are sure (are we?) that it's not lua memory usage, maybe memory is leaking on C side of things, but caused by some lua code creating massive amounts of objects (that are allocated/freed lua-side?). Flux mentioned entites trying to spawn in inactive mapblocks and disappearing, can that be the cause? Maybe some mod spamming `add_entity()` on each globalstep if it fails? Or something similar?

whosit commented

2023-07-23 11:48:09 +02:00

This would be pure coincidence, but I learned that moremesecons_entity_detector does not have any limit on range and people use it for mob farms and such. Since it just allocates a list for all entities it found and people use it with ranges like 5000 and in the nether mob farms, could that be the culprit?

Or am I missing something?

Since we discussed it in chat, Chache reported it:
#4988

UPD: tried creating massive amounts of detectors on the test, it lags the server, but can't see any other effects by just looking at dtime. Will need Alias to run perf and top or something X)

This would be pure coincidence, but I learned that `moremesecons_entity_detector` does not have any limit on range and people use it for mob farms and such. Since it just allocates a list for all entities it found and people use it with ranges like `5000` and in the nether mob farms, could that be the culprit? Or am I missing something? Since we discussed it in chat, Chache reported it: https://gitea.your-land.de/your-land/bugtracker/issues/4988 UPD: tried creating massive amounts of detectors on the test, it lags the server, but can't see any other effects by just looking at dtime. Will need Alias to run perf and top or something X)

sixer commented

2023-07-23 14:44:44 +02:00

Not sure about specific implementation of lua GC, but garbage collectors often run in background threads or intermittently always. So could be just coincidence, although one of the symptoms of approaching close to memory consumption limit is that the GC is run often and sometimes take relatively longer to run trying hard to free some memory.

whosit commented

2023-07-23 22:47:43 +02:00

We spent whole day running an array of almost 2000 entity detectors on the test server, it caused dtime to heavily depend on number of entities active (as expected), but did not seem to cause any leaks - memory usage also just corresponded to number of mobs.

Kinda obvious results, but at least that was checked.

We spent whole day running an array of almost 2000 entity detectors on the test server, it caused dtime to heavily depend on number of entities active (as expected), but did not seem to cause any leaks - memory usage also just corresponded to number of mobs. Kinda obvious results, but at least that was checked.

👍 1

AliasAlreadyTaken commented

2023-07-24 01:14:41 +02:00

I'd like to add "the release of unused mapblocks" to the list of suspects.

The release of unused blocks is governed by this setting:

#    How long the server will wait before unloading unused mapblocks, stated in seconds.
#    Higher value is smoother, but will use more RAM.
#    type: int min: 0 max: 4294967295
# server_unload_unused_data_timeout = 29

Our setting is server_unload_unused_data_timeout = 900, that's 15 minutes. So (after a restart) I teleported two accounts to random location with this script:

yl_events.timer = 0
yl_events.stopme = false
yl_events.teleport = function(dtime)
	if yl_events.stopme == true then return end
	yl_events.timer = yl_events.timer + dtime
	if yl_events.timer < 5 then
		return
	end
	yl_events.timer = 0
	local objs = core.get_connected_players()
	for _,obj in ipairs(objs) do
		local x = math.random(-30000,30000)
		local y = math.random(-30000,30000)
		local z = math.random(-30000,30000)
		local v = vector.new(x,y,z)
		obj:set_pos(v)
	end
end
minetest.register_globalstep(yl_events.teleport)

This resulted in a growing number of blocks loaded into memory (and mapgen running wild) and also a growing memory usage, both in a linear fashion. Maximum memory was 21.5GB virtual and 19.4GB reserved memory for Minetest_test and maximum number of loaded blocks around 1 Million.

While the number of loaded mapblocks was pretty much after the initial "filling phase", the memory did not return to anything below those maximum values, even though I logged off both accounts.

perf top:

  19.51%  minetestserver                   [.] MapSector::getBlocks
   5.77%  minetestserver                   [.] Map::timerUpdate
   5.54%  minetestserver                   [.] lj_tab_get
   1.97%  minetestserver                   [.] LuaEntitySAO::step
   1.93%  minetestserver                   [.] ServerMap::save
   1.81%  minetestserver                   [.] std::_Function_handler<void (ServerActiveObject*), ServerEnvironment::step(float)::{lambda(ServerActiveObject*   1.78%  minetestserver                   [.] lj_vm_next
   1.62%  minetestserver                   [.] MapNode::serializeBulk
   1.62%  minetestserver                   [.] server::ActiveObjectMgr::step
   1.58%  minetestserver                   [.] lj_str_new

It is possible that the Testserver is still busy with generating mapblocks and storing them in the database. So until MapSector::getBlocks goes down to a more reasonable value, I'll wait and see what the memory does at that point.

Edit: 3 hours after, with noone on the server, the virtual memory is still on 21.3GB, the reserved memory on 6.8GB. For comparison, the Main server requires 14.5GB virtual and 10.9GB reserved memory.

I'd like to add "the release of unused mapblocks" to the list of suspects. The release of unused blocks is governed by this setting: ``` # How long the server will wait before unloading unused mapblocks, stated in seconds. # Higher value is smoother, but will use more RAM. # type: int min: 0 max: 4294967295 # server_unload_unused_data_timeout = 29 ``` Our setting is `server_unload_unused_data_timeout = 900`, that's 15 minutes. So (after a restart) I teleported two accounts to random location with this script: ```lua yl_events.timer = 0 yl_events.stopme = false yl_events.teleport = function(dtime) if yl_events.stopme == true then return end yl_events.timer = yl_events.timer + dtime if yl_events.timer < 5 then return end yl_events.timer = 0 local objs = core.get_connected_players() for _,obj in ipairs(objs) do local x = math.random(-30000,30000) local y = math.random(-30000,30000) local z = math.random(-30000,30000) local v = vector.new(x,y,z) obj:set_pos(v) end end minetest.register_globalstep(yl_events.teleport) ``` This resulted in a growing number of blocks loaded into memory (and mapgen running wild) and also a growing memory usage, both in a linear fashion. Maximum memory was 21.5GB virtual and 19.4GB reserved memory for Minetest_test and maximum number of loaded blocks around 1 Million. ![grafik](/attachments/5f858a5a-4822-453d-8fc4-8d02b2a6154a) While the number of loaded mapblocks was pretty much after the initial "filling phase", the memory did not return to anything below those maximum values, even though I logged off both accounts. perf top: ``` 19.51% minetestserver [.] MapSector::getBlocks 5.77% minetestserver [.] Map::timerUpdate 5.54% minetestserver [.] lj_tab_get 1.97% minetestserver [.] LuaEntitySAO::step 1.93% minetestserver [.] ServerMap::save 1.81% minetestserver [.] std::_Function_handler<void (ServerActiveObject*), ServerEnvironment::step(float)::{lambda(ServerActiveObject* 1.78% minetestserver [.] lj_vm_next 1.62% minetestserver [.] MapNode::serializeBulk 1.62% minetestserver [.] server::ActiveObjectMgr::step 1.58% minetestserver [.] lj_str_new ``` It is possible that the Testserver is still busy with generating mapblocks and storing them in the database. So until MapSector::getBlocks goes down to a more reasonable value, I'll wait and see what the memory does at that point. Edit: 3 hours after, with noone on the server, the virtual memory is still on 21.3GB, the reserved memory on 6.8GB. For comparison, the Main server requires 14.5GB virtual and 10.9GB reserved memory.

grafik.png

50 KiB

flux commented

2023-07-24 07:24:15 +02:00

This would be pure coincidence, but I learned that moremesecons_entity_detector does not have any limit on range and people use it for mob farms and such. Since it just allocates a list for all entities it found and people use it with ranges like 5000 and in the nether mob farms, could that be the culprit?

such large ranges shouldn't have meaningful effect on long-term memory usage, but the performance considerations aren't great. see #3723 for relevant discussion.

> This would be pure coincidence, but I learned that `moremesecons_entity_detector` does not have any limit on range and people use it for mob farms and such. Since it just allocates a list for all entities it found and people use it with ranges like `5000` and in the nether mob farms, could that be the culprit? such large ranges shouldn't have meaningful effect on long-term memory usage, but the performance considerations aren't great. see #3723 for relevant discussion.

flux commented

2023-07-24 07:26:52 +02:00

server_unload_unused_data_timeout = 900

perhaps we should ask the engine folks to add a maximum memory usage limit for unused mapblock storage?

> server_unload_unused_data_timeout = 900 perhaps we should ask the engine folks to add a maximum memory usage limit for unused mapblock storage?

sixer commented

2023-07-24 18:57:42 +02:00

There is client_mapblock_limit with default of 7500 regulating how many blocks can client keep in memory. Implementing similar measure on server should be straightforward ... or maybe not. Client can afford to just discard a block, as server will simply send it again. But server must save 'dirty' blocks to database properly, so block discarding may cause lot of disk writes ...

There is `client_mapblock_limit` with default of 7500 regulating how many blocks can client keep in memory. Implementing similar measure on server should be straightforward ... or maybe not. Client can afford to just discard a block, as server will simply send it again. But server must save 'dirty' blocks to database properly, so block discarding may cause lot of disk writes ...

flux commented

2023-07-26 04:00:00 +02:00

There is client_mapblock_limit with default of 7500 regulating how many blocks can client keep in memory. Implementing similar measure on server should be straightforward ... or maybe not. Client can afford to just discard a block, as server will simply send it again. But server must save 'dirty' blocks to database properly, so block discarding may cause lot of disk writes ...

the # of mapblocks is not the same thing as the memory they use, which can vary wildly. w/ all the metadata that gets stored, single mapblocks can become performance bombs.

> There is `client_mapblock_limit` with default of 7500 regulating how many blocks can client keep in memory. Implementing similar measure on server should be straightforward ... or maybe not. Client can afford to just discard a block, as server will simply send it again. But server must save 'dirty' blocks to database properly, so block discarding may cause lot of disk writes ... > the # of mapblocks is not the same thing as the memory they use, which can vary wildly. w/ all the metadata that gets stored, single mapblocks can become performance bombs.

flux referenced this issue

2023-08-13 19:56:42 +02:00

flux reports: major extended lag episode ... #5113

AzelfYL referenced this issue

2023-12-19 01:33:28 +01:00

rheo reports: huge, ongoing lag ... #5732

AzelfYL referenced this issue

2024-04-14 01:36:05 +02:00

OOM #6642

AliasAlreadyTaken commented

2024-05-14 21:41:37 +02:00

Might that be the reason??