Race condition at ChunkProviderer.getChunk?

hagish · May 11, 2012

I read through your code the last hours and it is nice. Quite readable and "clean".
And now I have a question

As far as I can tell there is at most one Runnable per Simulator Type holding its own _activeBlocks List.
But GrowthSimulator.executeSimulation and LiquidSimulator.executeSimulation can run concurrently and
access ChunkProviderer.getChunk concurrently (via LocalWorldProvider.getBlock).
So theoretically

Code:

        // ChunkProviderer.getChunk        
        c = _farChunkCache.get(id);
        if (c == null) {
            c = new Chunk(_parent, x, y, z);
        }

        _nearChunkCache.put(id, c);

can run in parallel and create the same Chunk multiple times. I could not find any mechanism to prevent this.

Am I getting this wrong or is it guaranteed that chunks and blocks exist prior running a simulation on them?

Cervator · May 11, 2012

Nice level of analysis! Kudos. I haven't tinkered with that part of the code, so I couldn't say, but I do know that we have a bug in that exact area of the code. We have it recorded as a memory issue, but maybe that area could use some more polish in general

https://github.com/MovingBlocks/Terasology/issues/231

There was also a liquid simulator crash-bug on exit, tho that was probably related to game state and fixed by implementing the Entity System - haven't seen it since

https://github.com/MovingBlocks/Terasology/issues/220

And welcome - thanks for the effort! Feel free to dig deeper in that area, we could use the review and any tweaks you'd like to do

hagish · May 12, 2012

I tried to fix it and optimize the "Update Close Chunks" activity because on my computer it caused some nasty spikes.
You can find the changes at https://github.com/hagish/Terasology

Cervator · May 12, 2012

That is some really fancy looking code, and especially the math! Seems stable too, flew around for quite a while trying to provoke stuff. But then I haven't hit the cache failures some have experienced, and we don't have any solid per-build type performance metrics, so hard to really get a feel for what changed

It looks great tho, so many thanks! We can always use more optimizations. You could feel the difference on your machine, with the spikes going away?

hagish · May 12, 2012

Yes after the changes I got less lags flying around. But it is more gut instinct than scientific comparison.
Because my changes were quite near the core you (=team terasology) should double check them.

Immortius · May 13, 2012

Starting to look at implementing multiplayer, I'm going to do a refactor of the Chunk/ChunkProvider/WorldProvider/ChunkGenerator system. At the moment Chunk depends on LocalWorldProvider, which is a bit of an issue for introducing a RemoteWorldProvider.

A couple of things I've noticed

The chunk position hashcode is used to identify the chunk, especially when saving and loading it. This means the change to Vector3i hashcode broke existing saved games. I'm going to change it to just use the position itself, which avoids the issue with hashcode collisions as well (previously was possible to load a previous chunk in a new position - if you walked about far enough).
Tested the improved hashcode, seems quite good. No collisions in the -500 <= x,y,z < 500 region, some collisions starting when that is extended to +/- 700. I would suggest using an extra bit from x and z to get the full 32-bits of the hash used.
There is probably a bit too much going on in Chunk. I think some of the activities like light calculation and setting neighbors dirty when necessary should be moved out into the WorldProvider or similar, which is at the correct level to know about multiple chunks.

hagish · May 13, 2012

Immortius said:
The chunk position hashcode is used to identify the chunk, especially when saving and loading it. This means the change to Vector3i hashcode broke existing saved games. I'm going to change it to just use the position itself, which avoids the issue with hashcode collisions as well (previously was possible to load a previous chunk in a new position - if you walked about far enough).

I suspected something like this but due to not experience loading problems during the testing I did not dig deeper into this.
Depending only on position other than some not obviously connected Vector3i hash code sounds more logical. Yes.

Immortius said:
Tested the improved hashcode, seems quite good. No collisions in the -500 <= x,y,z < 500 region, some collisions starting when that is extended to +/- 700. I would suggest using an extra bit from x and z to get the full 32-bits of the hash used.

Sounds reasonable. I will check this and adjust the code.

Immortius · May 15, 2012

Looking through the world code, in general there appears to be a lack of concurrency safety. There's locks around chunk creation, and locks around mesh and rigid body updates, but nothing around reads or updates to blocks or lighting - and there are simulator threads interacting with these.

There's a few different tactics that could be used - force all writes through a single thread while allowing background threads to read and plan changes? Per chunk read-write locks? Full concurrency support built around check-and-swap mechanisms? Considerations include how lighting updates will be done (they go across multiple chunks, and propagate cell by cell), ensuring chunks are marked dirty if they're different from the mesh (even if it is still being generated), the far caching of chunks and having newly generated chunks safely attached to the world in the potential presence of changing lighting...

It is going to be fun messing with all of this.

hagish · May 15, 2012

Yes, the concurrency safety lack is quite general in the world code.

Just some thoughts:

Another solution would be to do some kind of chunk transactions (like optimistic-locking-multiversion-chunks).
Writing transactions locally store the modifications during write access and all others can read the still unmodified chunks.
So everything could work in parallel but must be capable of rerunning their transactions in case of a collision.
Knowing about the running transactions would be useful in chunk storing and unloading.

Another approach would be to partition by chunks and serialize all tasks on one chunk. But then a slow simulation task could block the mesh updates. It could be necessary to differentiate between general simulation stuff (liquid, erosion, light?) and "faster" low level operations (destroying a block, light?).

Is there currently anything to prevent "cascading" simulations to spawn the whole world?

Cervator · May 15, 2012

Not sure about cascading, I'd think not for light as that would be pretty dramatic, while liquid is probably too limited to trigger anything like that.

On IRC the other day regions came up, I understand Spout is using that to sub-divide the world further. We might want to consider that for future-proofing, and I figure that would also impact the locking setup since different regions could run independently?

One region being some number of chunks in a square / cube (the latter if we change the number of y chunks from 1)

Immortius · May 16, 2012

Well, as long as regions are connected I don't see any immediate benefit to further divisions besides chunks. Maybe within the far cache regions could be used to determine which chunks are cached in memory and which are left on disk. It might be useful to assemble temporary "regions" of chunks for light propagation as well I guess.

hagish said:
Is there currently anything to prevent "cascading" simulations to spawn the whole world?

The current system has a concept of a "fresh" chunks. These are chunks that exist but haven't been generated yet. While anything can create a new chunk, only the renderer currently can cause a chunk to be generated. I'm kind of tossing up replacing fresh chunks with semi-generated chunks - that have had any local features generated, but not features that can overlap chunks (like trees) or their lighting connected outwards.

So the chunk lifecycle would be something like:

Created as Semi-Generated - has basic terrain, internal lighting. Generated disconnected from the world by generators working directly on the chunk. Not eligible for rendering or communicating to clients. Can be updated through the world though.

-> Fully Generated - has (potentially) chunk-boundary crossing features generated, and lighting has been propagated out of it. The features are generated by generators working against the world rather than directly on the chunk (so using the same interfaces as any other block changing process).

-> Fully Connected - surrounded by fully generated chunks. Eligible for rendering and communication. Only at this point is a chunk correctly lit. Surrounded means all 8/26 chunks (2D/3D) around it. This means that to get a single renderable chunk requires a layer of 8 fully generated, surrounded by a layer of 16 semi generated chunks - or if we add vertical chunks, 26 fully generated and 98 semi generated. Although if we have vertical chunks we would drop the vertical height of each chunk I guess?

Preferably simulations should be set up to work only on fully connected chucks, and retain information for simulating unloaded chunks until they are next loaded or reanalyze them then. So generators should be hooked into the chunk lifecycle and receive events as chunks are loaded and unloaded.

I do like that changes are done through a World interface in the current system, that automatically handles lighting propagation and hides the existence of chunks from clients. Basically it is just changing blocks and I wouldn't want to lose that. Simplest fix I can think of is to change from setBlock(Vector3i pos, Block type) to boolean setBlock(Vector3i pos, Block newType, Block oldType), with it failing cleanly if oldType doesn't match the existing block. Oh, and something like changeBlocks(Vector3i[] positions, Block[] newBlockTypes, Block[] oldBlockTypes) to handle multiple at once. An alternative that may be good for longer running background processes is a transaction proxy through which state can be queried and changes requested, before being committed to the world. The proxy can keep track of chunk versions or do whatever magic is necessary for optimistic concurrency.

Begla, if you're about, any thoughts/comments? Am I missing something?

Cervator · May 16, 2012

Sounds good to me. My mention of regions was mostly thinking multiplayer, when you might have players far from each others, on different unconnected regions. Seems nice to be able to have the server process them separately

hagish · May 16, 2012

Yes, block life-cycles sound good and keeping the simple get/setBlock interface would be very nice.
The multiblock get/set with the old option is a good way to abstract away the locking issue.
Theoretically there could be an ABA update problem but I think it is completely unimportant due to this being a game not a bank.

Immortius · May 18, 2012

Cervator said:
Sounds good to me. My mention of regions was mostly thinking multiplayer, when you might have players far from each others, on different unconnected regions. Seems nice to be able to have the server process them separately

Ah yes. I was thinking of doing something in this space - at the moment the system maintains a set of chunks around the camera's location, but in general it should maintain a number of regions of chunks, around the positions of multiple entities (players in a multiplayer game).

Immortius · May 27, 2012

I'm going to try a slightly different chunk lifecycle - while the one I suggested above was ok, it is apparent from testing that it is important to do as much as possible before generating the initial lighting - sunlight updates are a bit slow (maybe my algorithm is less than ideal, maybe there doesn't really need to be quite so many thousands of block updates per tree). A batch update system would also help, but easier to just skip it entirely during generation.

So instead I've switched to:
* Single chunk block generation
* Adjacency group block generation (involving a chunk and all adjacent chunks)
* Single chunk light generation
* Patch chunk lighting to adjacent
* Await surrounded by properly lit chunks
* Complete

These are all running in background threads, with locking while a chunk is modified.

hagish · May 29, 2012

And simulations run on completed blocks?
So a multi-chunk simulation step (bleeding into the other chunks) requests adjacent chunks and operates on itself and its direct neighbors? So similar to light updates but after all "low-level" loading and creating work is done.

Immortius · May 30, 2012

In my mind simulations and anything else changing the the world should be limited to only complete chunks, and work either through the world, or possibly subviews of the world (congruent subregions). So they shouldn't be chunk aware, except in needing to know whether an area of the world is available to be worked on, and getting it loaded/created if it is not.

Race condition at ChunkProviderer.getChunk?

hagish

New Member

Cervator

Org Co-Founder & Project Lead

hagish

New Member

Cervator

Org Co-Founder & Project Lead

hagish

New Member

Immortius

Lead Software Architect

hagish

New Member

Immortius

Lead Software Architect

hagish

New Member

Cervator

Org Co-Founder & Project Lead

Immortius

Lead Software Architect

Cervator

Org Co-Founder & Project Lead

hagish

New Member

Immortius

Lead Software Architect

Immortius

Lead Software Architect

hagish

New Member

Immortius

Lead Software Architect