OOM #4943

Open
opened 2023-07-15 12:33:41 +00:00 by AliasAlreadyTaken · 18 comments

[711040.808986] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/system.slice/cron.service,task=minetestserver,pid=1894647,uid=1001
[711040.809254] Out of memory: Killed process 1894647 (minetestserver) total-vm:54905192kB, anon-rss:38028856kB, file-rss:2304kB, shmem-rss:0kB, UID:1001 pgtables:106184kB oom_score_adj:0
[711044.436208] oom_reaper: reaped process 1894647 (minetestserver), now anon-rss:416kB, file-rss:792kB, shmem-rss:0kB

[711040.808986] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/system.slice/cron.service,task=minetestserver,pid=1894647,uid=1001 [711040.809254] Out of memory: Killed process 1894647 (minetestserver) total-vm:54905192kB, anon-rss:38028856kB, file-rss:2304kB, shmem-rss:0kB, UID:1001 pgtables:106184kB oom_score_adj:0 [711044.436208] oom_reaper: reaped process 1894647 (minetestserver), now anon-rss:416kB, file-rss:792kB, shmem-rss:0kB

pgtables:106184kB

could turning on or tweaking (transparent) huge pages help here?

> pgtables:106184kB could turning on or tweaking (transparent) huge pages help here?

I assume this is because of the voice attack earlier ?

I assume this is because of the voice attack earlier ?
Member

Before the server crashed I saw Ravise fighting with three Voicers, ran there and saw, they were all freeze. I could ran for minimum twenty seconds around there and out of the danger zone, because I am knowing what will happen soon, before the server goes down.

Before the server crashed I saw Ravise fighting with three Voicers, ran there and saw, they were all freeze. I could ran for minimum twenty seconds around there and out of the danger zone, because I am knowing what will happen soon, before the server goes down.
Member

most minetest mods are pretty memory conscious, because luajit used to have a 1GiB limit. it no longer has that limit, but it'd be useful to see how much memory lua was using before the crash, before blaming a mod for this issue. we log how much memory lua is using every 30 seconds or something.

most minetest mods are pretty memory conscious, because luajit used to have a 1GiB limit. it no longer has that limit, but it'd be useful to see how much memory *lua* was using before the crash, before blaming a mod for this issue. we log how much memory lua is using every 30 seconds or something.
flux added the
1. kind/bug
2. prio/critical
labels 2023-07-17 03:18:46 +00:00
Author
Owner

grafik

![grafik](/attachments/71c56549-41ac-451e-9aaa-f050e992fd70)
Author
Owner

grafik

![grafik](/attachments/3246b4bf-28e8-4268-8407-d251ad3ec29a)
Author
Owner
  1. Is there a way to see the memory allocation of the engine compared to that of the mods?

Most likely yes, that's what flux did in what logs "lua is using ... MB". But ...

  1. Is there a way to see the memory allocation of a single mod?

  2. What's lj_alloc_free and release_unused_segments ? What's gc_sweep?

1. Is there a way to see the memory allocation of the engine compared to that of the mods? Most likely yes, that's what flux did in what logs "lua is using ... MB". But ... 2. Is there a way to see the memory allocation of a single mod? 3. What's lj_alloc_free and release_unused_segments ? What's gc_sweep?
Member
  1. Is there a way to see the memory allocation of the engine compared to that of the mods?

we can see what lua has allocated. i doubt we have a major lua memory leak. we log how much memory the lua environment is using already, see #4249. if you've got privs, you can also see how much memory lua is using w/ the /memory command. i don't remember ever seeing YL getting above 1GiB.

  1. Is there a way to see the memory allocation of a single mod?

while implementing the spawnit mod, i've created a tool to approximate how much memory a particular data structure is using. i can't guarantee how accurate it is, and it's very expensive to use on complicated data structures.

see https://github.com/fluxionary/minetest-futil/blob/main/util/memory.lua

  1. What's lj_alloc_free and release_unused_segments ? What's gc_sweep?

i don't know what any of these terms mean exactly rn. the terms seem like things i should get to know better. gc_sweep makes me think of java and lisp, though shrug

> 1. Is there a way to see the memory allocation of the engine compared to that of the mods? we can see what lua has allocated. i doubt we have a major lua memory leak. we log how much memory the lua environment is using already, see #4249. if you've got privs, you can also see how much memory lua is using w/ the `/memory` command. i don't remember ever seeing YL getting above 1GiB. > 2. Is there a way to see the memory allocation of a single mod? while implementing the spawnit mod, i've created a tool to *approximate* how much memory a particular data structure is using. i can't guarantee how accurate it is, and it's *very* expensive to use on complicated data structures. see https://github.com/fluxionary/minetest-futil/blob/main/util/memory.lua > 3. What's lj_alloc_free and release_unused_segments ? What's gc_sweep? i don't know what any of these terms mean exactly rn. the terms seem like things i should get to know better. gc_sweep makes me think of java and lisp, though *shrug*

gc_sweep would probably be part of garbage collector in lua

gc_sweep would probably be part of garbage collector in lua
Member
  1. What's lj_alloc_free and release_unused_segments ? What's gc_sweep?

Those 3 are luajit functions:

/* Partial sweep of a GC list. */
static GCRef *gc_sweep(global_State *g, GCRef *p, uint32_t lim) { ... }

as comment says. Seems to come from "incremental mark and sweep garbage collection strategy".

static LJ_NOINLINE void *lj_alloc_free(void *msp, void *ptr) { ... }

No comment, but my guess: marks allocated memory chuck as unused and consolidates it with surrounding unused chunks.

/* Unmap and unlink any mmapped segments that don't contain used chunks */
static size_t release_unused_segments(mstate m) { ... }

Purely just speculation:

If we are sure (are we?) that it's not lua memory usage, maybe memory is leaking on C side of things, but caused by some lua code creating massive amounts of objects (that are allocated/freed lua-side?). Flux mentioned entites trying to spawn in inactive mapblocks and disappearing, can that be the cause? Maybe some mod spamming add_entity() on each globalstep if it fails? Or something similar?

> 3. What's lj_alloc_free and release_unused_segments ? What's gc_sweep? Those 3 are luajit functions: ```c /* Partial sweep of a GC list. */ static GCRef *gc_sweep(global_State *g, GCRef *p, uint32_t lim) { ... } ``` as comment says. Seems to come from "incremental mark and sweep garbage collection strategy". ```c static LJ_NOINLINE void *lj_alloc_free(void *msp, void *ptr) { ... } ``` No comment, but my guess: marks allocated memory chuck as unused and consolidates it with surrounding unused chunks. ```c /* Unmap and unlink any mmapped segments that don't contain used chunks */ static size_t release_unused_segments(mstate m) { ... } ``` Purely just speculation: If we are sure (are we?) that it's not lua memory usage, maybe memory is leaking on C side of things, but caused by some lua code creating massive amounts of objects (that are allocated/freed lua-side?). Flux mentioned entites trying to spawn in inactive mapblocks and disappearing, can that be the cause? Maybe some mod spamming `add_entity()` on each globalstep if it fails? Or something similar?
Member

This would be pure coincidence, but I learned that moremesecons_entity_detector does not have any limit on range and people use it for mob farms and such. Since it just allocates a list for all entities it found and people use it with ranges like 5000 and in the nether mob farms, could that be the culprit?

Or am I missing something?

Since we discussed it in chat, Chache reported it:
#4988

UPD: tried creating massive amounts of detectors on the test, it lags the server, but can't see any other effects by just looking at dtime. Will need Alias to run perf and top or something X)

This would be pure coincidence, but I learned that `moremesecons_entity_detector` does not have any limit on range and people use it for mob farms and such. Since it just allocates a list for all entities it found and people use it with ranges like `5000` and in the nether mob farms, could that be the culprit? Or am I missing something? Since we discussed it in chat, Chache reported it: https://gitea.your-land.de/your-land/bugtracker/issues/4988 UPD: tried creating massive amounts of detectors on the test, it lags the server, but can't see any other effects by just looking at dtime. Will need Alias to run perf and top or something X)

Not sure about specific implementation of lua GC, but garbage collectors often run in background threads or intermittently always. So could be just coincidence, although one of the symptoms of approaching close to memory consumption limit is that the GC is run often and sometimes take relatively longer to run trying hard to free some memory.

Not sure about specific implementation of lua GC, but garbage collectors often run in background threads or intermittently always. So could be just coincidence, although one of the symptoms of approaching close to memory consumption limit is that the GC is run often and sometimes take relatively longer to run trying hard to free some memory.
Member

We spent whole day running an array of almost 2000 entity detectors on the test server, it caused dtime to heavily depend on number of entities active (as expected), but did not seem to cause any leaks - memory usage also just corresponded to number of mobs.

Kinda obvious results, but at least that was checked.

We spent whole day running an array of almost 2000 entity detectors on the test server, it caused dtime to heavily depend on number of entities active (as expected), but did not seem to cause any leaks - memory usage also just corresponded to number of mobs. Kinda obvious results, but at least that was checked.
Author
Owner

I'd like to add "the release of unused mapblocks" to the list of suspects.

The release of unused blocks is governed by this setting:

#    How long the server will wait before unloading unused mapblocks, stated in seconds.
#    Higher value is smoother, but will use more RAM.
#    type: int min: 0 max: 4294967295
# server_unload_unused_data_timeout = 29

Our setting is server_unload_unused_data_timeout = 900, that's 15 minutes. So (after a restart) I teleported two accounts to random location with this script:

yl_events.timer = 0
yl_events.stopme = false
yl_events.teleport = function(dtime)
	if yl_events.stopme == true then return end
	yl_events.timer = yl_events.timer + dtime
	if yl_events.timer < 5 then
		return
	end
	yl_events.timer = 0
	local objs = core.get_connected_players()
	for _,obj in ipairs(objs) do
		local x = math.random(-30000,30000)
		local y = math.random(-30000,30000)
		local z = math.random(-30000,30000)
		local v = vector.new(x,y,z)
		obj:set_pos(v)
	end
end
minetest.register_globalstep(yl_events.teleport)

This resulted in a growing number of blocks loaded into memory (and mapgen running wild) and also a growing memory usage, both in a linear fashion. Maximum memory was 21.5GB virtual and 19.4GB reserved memory for Minetest_test and maximum number of loaded blocks around 1 Million.

grafik

While the number of loaded mapblocks was pretty much after the initial "filling phase", the memory did not return to anything below those maximum values, even though I logged off both accounts.

perf top:

  19.51%  minetestserver                   [.] MapSector::getBlocks
   5.77%  minetestserver                   [.] Map::timerUpdate
   5.54%  minetestserver                   [.] lj_tab_get
   1.97%  minetestserver                   [.] LuaEntitySAO::step
   1.93%  minetestserver                   [.] ServerMap::save
   1.81%  minetestserver                   [.] std::_Function_handler<void (ServerActiveObject*), ServerEnvironment::step(float)::{lambda(ServerActiveObject*   1.78%  minetestserver                   [.] lj_vm_next
   1.62%  minetestserver                   [.] MapNode::serializeBulk
   1.62%  minetestserver                   [.] server::ActiveObjectMgr::step
   1.58%  minetestserver                   [.] lj_str_new

It is possible that the Testserver is still busy with generating mapblocks and storing them in the database. So until MapSector::getBlocks goes down to a more reasonable value, I'll wait and see what the memory does at that point.

Edit: 3 hours after, with noone on the server, the virtual memory is still on 21.3GB, the reserved memory on 6.8GB. For comparison, the Main server requires 14.5GB virtual and 10.9GB reserved memory.

I'd like to add "the release of unused mapblocks" to the list of suspects. The release of unused blocks is governed by this setting: ``` # How long the server will wait before unloading unused mapblocks, stated in seconds. # Higher value is smoother, but will use more RAM. # type: int min: 0 max: 4294967295 # server_unload_unused_data_timeout = 29 ``` Our setting is `server_unload_unused_data_timeout = 900`, that's 15 minutes. So (after a restart) I teleported two accounts to random location with this script: ```lua yl_events.timer = 0 yl_events.stopme = false yl_events.teleport = function(dtime) if yl_events.stopme == true then return end yl_events.timer = yl_events.timer + dtime if yl_events.timer < 5 then return end yl_events.timer = 0 local objs = core.get_connected_players() for _,obj in ipairs(objs) do local x = math.random(-30000,30000) local y = math.random(-30000,30000) local z = math.random(-30000,30000) local v = vector.new(x,y,z) obj:set_pos(v) end end minetest.register_globalstep(yl_events.teleport) ``` This resulted in a growing number of blocks loaded into memory (and mapgen running wild) and also a growing memory usage, both in a linear fashion. Maximum memory was 21.5GB virtual and 19.4GB reserved memory for Minetest_test and maximum number of loaded blocks around 1 Million. ![grafik](/attachments/5f858a5a-4822-453d-8fc4-8d02b2a6154a) While the number of loaded mapblocks was pretty much after the initial "filling phase", the memory did not return to anything below those maximum values, even though I logged off both accounts. perf top: ``` 19.51% minetestserver [.] MapSector::getBlocks 5.77% minetestserver [.] Map::timerUpdate 5.54% minetestserver [.] lj_tab_get 1.97% minetestserver [.] LuaEntitySAO::step 1.93% minetestserver [.] ServerMap::save 1.81% minetestserver [.] std::_Function_handler<void (ServerActiveObject*), ServerEnvironment::step(float)::{lambda(ServerActiveObject* 1.78% minetestserver [.] lj_vm_next 1.62% minetestserver [.] MapNode::serializeBulk 1.62% minetestserver [.] server::ActiveObjectMgr::step 1.58% minetestserver [.] lj_str_new ``` It is possible that the Testserver is still busy with generating mapblocks and storing them in the database. So until MapSector::getBlocks goes down to a more reasonable value, I'll wait and see what the memory does at that point. Edit: 3 hours after, with noone on the server, the virtual memory is still on 21.3GB, the reserved memory on 6.8GB. For comparison, the Main server requires 14.5GB virtual and 10.9GB reserved memory.
Member

This would be pure coincidence, but I learned that moremesecons_entity_detector does not have any limit on range and people use it for mob farms and such. Since it just allocates a list for all entities it found and people use it with ranges like 5000 and in the nether mob farms, could that be the culprit?

such large ranges shouldn't have meaningful effect on long-term memory usage, but the performance considerations aren't great. see #3723 for relevant discussion.

> This would be pure coincidence, but I learned that `moremesecons_entity_detector` does not have any limit on range and people use it for mob farms and such. Since it just allocates a list for all entities it found and people use it with ranges like `5000` and in the nether mob farms, could that be the culprit? such large ranges shouldn't have meaningful effect on long-term memory usage, but the performance considerations aren't great. see #3723 for relevant discussion.
Member

server_unload_unused_data_timeout = 900

perhaps we should ask the engine folks to add a maximum memory usage limit for unused mapblock storage?

> server_unload_unused_data_timeout = 900 perhaps we should ask the engine folks to add a maximum memory usage limit for unused mapblock storage?

There is client_mapblock_limit with default of 7500 regulating how many blocks can client keep in memory. Implementing similar measure on server should be straightforward ... or maybe not. Client can afford to just discard a block, as server will simply send it again. But server must save 'dirty' blocks to database properly, so block discarding may cause lot of disk writes ...

There is `client_mapblock_limit` with default of 7500 regulating how many blocks can client keep in memory. Implementing similar measure on server should be straightforward ... or maybe not. Client can afford to just discard a block, as server will simply send it again. But server must save 'dirty' blocks to database properly, so block discarding may cause lot of disk writes ...
Member

There is client_mapblock_limit with default of 7500 regulating how many blocks can client keep in memory. Implementing similar measure on server should be straightforward ... or maybe not. Client can afford to just discard a block, as server will simply send it again. But server must save 'dirty' blocks to database properly, so block discarding may cause lot of disk writes ...

the # of mapblocks is not the same thing as the memory they use, which can vary wildly. w/ all the metadata that gets stored, single mapblocks can become performance bombs.

> There is `client_mapblock_limit` with default of 7500 regulating how many blocks can client keep in memory. Implementing similar measure on server should be straightforward ... or maybe not. Client can afford to just discard a block, as server will simply send it again. But server must save 'dirty' blocks to database properly, so block discarding may cause lot of disk writes ... > the # of mapblocks is not the same thing as the memory they use, which can vary wildly. w/ all the metadata that gets stored, single mapblocks can become performance bombs.
Sign in to join this conversation.
No Milestone
No project
No Assignees
6 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: your-land/bugtracker#4943
No description provided.