Per-block data storage

Cervator

Org Co-Founder & Project Lead
Contributor
Design
Logistics
SpecOps
Name: Per Block Data Storage
Summary: What core data do we store at a per-block level? Block ID, light values, extras
Scope: Engine
Current Goal: Refactored central data storage, including support (but not necessarily functionality) for Block dynamics and secondary block resources (oil, gas, moisture, plant growth resources...). Possibly making light storage more multi-purpose (for solid blocks). Possibly framework for mods to register new per-block data values (expensive in memory usage)
Phase: Design
Curator: Cervator
Related: #137, #139, #140

Making an issue for this in GitHub - https://github.com/MovingBlocks/Terasology/issues/139 - and leaving details and discussion here :)

Current per-block arrays in Chunk.java:

* One-byte Block ID - this may fall short of capacity in the long-term, especially if Custom Blocks (special shape, possible composite texture - created by a player) need unique block IDs
* Half-byte Sunlight - lighting calculations
* Half-byte Light - like-wise (I figure, anyway!)
* Half-byte States - currently used for tracking how far liquid can flow before it stops

Begla on IRC at some point brought up an issue about how Java deals with memory storage for small stuff, and without wanting to put details here since I don't remember exactly there was something about ending up using 4 bytes anyway. So under some circumstances we might want to just use an int and do something fancy to store several things in it at once. For reference a 16x16x256 chunk with a single int per block would use 262,144 bytes of memory - which is both scary and sooo tempting. Mix in the possibility to alter the chunk size (perhaps even by the up/down axis) and who knows...

Either way, we'll likely end up changing the current setup, so thread here to discuss what we need and how to do it! My personal stance / wish list:

* Possibly using a short for block ID to not get cramped by the byte limit. Or if parsing an int, 12 bits would probably be enough (4096 values)
* Changing the state concept into a secondary "resource" value per block, exact meaning depending on context. Making an issue for this system too. Examples: Visible liquid height in liquid blocks (or partial blocks like stairs / slopes), moisture in air (weather potential), moisture in dirt (growth system / fun with aquifers), oil and gas in deep minerals (bring your mining canary). Really looking forward to this one - and it makes perfect sense as it does something in every block type.
* Unsure about this next one, but - block integrity persisted? viewtopic.php?f=4&t=171&p=977#p977 (EDIT way later: This was either this or that thread) explains the concept, but not sure yet whether the integrity values should be calculated as needed (rarely) or stored (as they could be left on disk most the time). Liquid nibble + integrity byte = mudslides :D

Latter point brings up another part of the puzzle. Do we have two types of data - the always-loaded in memory (chunk block IDs) and some completely separate block meta-data only loaded as needed? Also keeping in mind the potential to store some stuff (creatures & concepts) as entities in an ES instead.
 

Immortius

Lead Software Architect
Contributor
Architecture
GUI
Are you thinking something along the lines of

n bits - block type (probably > 8)
2 bits - liquid content determinator (none, water, magma, oil?)
n bits - liquid level and/or flow info
1 bit - light determinator (0 means solid translucent, 1 means can be lit)
if light
4 bits sky lit level
4 bits block lit level
else
8 bits to use for something interesting but only for solid blocks?

Edit: Although some of those bits are derivable from the block type and can be skipped.
 

Cervator

Org Co-Founder & Project Lead
Contributor
Design
Logistics
SpecOps
Yeah something like that. So I guess we might actually be looking at declaring one "primary" block data variable that is always going to be loaded and available for fast look-up, and any potential secondaries (like block integrity, that isn't always needed)

I suppose it is also a bit about balancing the convenience of accessing simple types (like the current TeraArray + TeraSmartArray) vs building a bigger (TeraMegaArray?) one with more functions to retrieve specific pieces out of one variable? Is it faster to use separate nibble arrays?
 

Kai Kratz

Witch Doctor
Contributor
Architecture
Logistics

Cervator

Org Co-Founder & Project Lead
Contributor
Design
Logistics
SpecOps
Bumping this as I'm posting Block Dynamics stuff. At first to support that it will be easiest to use the existing data structures (TeraArray + TeraSmartArray), tho I do believe protobuf has popped up elsewhere - might be good with a refresher here on what we're starting to use it for and how it could be used per-block?

Also, I might have posted this elsewhere, but I wanted to drop in logs of a suggestion T3 made on IRC, might work as a front-end to the protobuf stuff? Are we thinking we can skip bit manipulation at this point?

[11] <t3hk0d3> Cervator: you just store data separately
[11] <t3hk0d3> it would consume more memory
[11] <t3hk0d3> but would be faster than shifting bits each time
[11] <t3hk0d3> like
[11] <t3hk0d3> Metadata metadata = worldProvider.getBlockMetadata(x, y, z);
[11] <t3hk0d3> int orientation = metadata.get(Metadata.ORIENTATION);
[11] <t3hk0d3> lookups is minimal because of offsets and hashmaps
 

dei

Java Priest
Contributor
Design
Art
Seems like this thread is missing from the old board, or am I too dumb to use the search-engine?
http://board.movingblocks.net/viewtopic.php?f=4&t=173

The current architecture, as far as I understand it, uses simple byte-arrays and saves everything in the world including air-blocks. (its always 256 blocks high)

I fully understand that octrees and the like are way to far to implement in such an early phase, but I have a simple AND effective proposition:

As arrays use their full allocated memory it also uses 256 bytes * 16 * 16 for each chunk. The ChunkGenerator generates in average a world which is filled up to around 128 (didn't count exactly) with real blocks. The Array elements from 128 to 256 are mostly left at their default value = 0 = air. So above y-coordinate 128 we have a very sparse array of some blocks which waste the same amount of memory as all the other blocks use.

If we could replace the datastructure above 128 with something more memory efficient we could nearly save 1/2 of overall memory (we simply only save blocks other than air). I developed a SparseArray structure which could drop in for the upper half of the world posted under (http://pastebin.com/HQgeHjWR). It has (compared to a HashMap) faster read times but slower random-write-time (continous-write is fast too for map-generation).

What do you think of it (idea and rough implementation)? Did I understand something wrong and its already optimized in trunk? Is it worth to integrate into my Terasology-Branch (could make it switchable)?

EDIT: Forgot to mention. Uses a presorted array with binary search. My performance and memory calculations will follow tomorrow, but are looking really good...
 

Cervator

Org Co-Founder & Project Lead
Contributor
Design
Logistics
SpecOps
Found it! Old thread was hiding on page 3 in "General Dev" :)

*activates magic thread merge powers*

In the future I suspect we'll break with the 256 tall high world, so basing it on the top half of the current world probably would go funny. However, it does seem likely that there'll be big spaces of air in places. If we end up with smaller chunks (say 32x32x32) you might have whole chunks of pure air. There is probably some optimization potential there if we really want to worry about it - at the cost of complexity, however. That ends up sounding a bit like octrees.

Non-air examples might be trickier, as I expect we'll be moving to a geological setup where "dirt" and "stone" are classifications rather than blocks (using layers of fancier minerals etc instead). So I don't jump on "Hey we could use this for dirt and stone too!" right away.

In conclusion: I dunno? I like the idea of doing something like that but am not really on top of advanced performance tweaks like that vs cost in the form of more sophisticated code.

A switchable integration might well be worth adding so we have the option to use it and experiment. Kudos for trying it out and offering it up :)

Edit: Begla popped up on IRC for a bit - seemed he was happy to see a sparse array thingie pop up!

Edit2: Reminding myself you're cephos on GitHub with a linky to the issue you hit: https://github.com/MovingBlocks/Terasology/issues/139#issuecomment-5886341
 

Immortius

Lead Software Architect
Contributor
Architecture
GUI
You might have missed this but because you have a boolean[] the size of the chunk, and a boolean uses a byte in memory, you are using at a minimum the same amount of memory as the current simple byte[] solution. :) You should replace that array with a BitSet or similar.

Other than that looks good, although I would have to see metrics comparing it against the current TeraArray. While reducing memory size is nice, read speeds are particularly crucial for the in memory structure.

What would be nice is if we had a common interface that both this and TeraArray implemented, so we can mix and match implementations easily (and develop new ones against it).

There is a semi-bug in your bounds checking btw - (18,5,1) will result in a valid index even though x is out of bounds.
 

woodspeople

Member
Contributor
Design
Note: if my tree growth generator is ever to work in TS, non-air blocks will need:

for every vertical block column:
- index of light falling from sun (should vary by biome/chunk and be impacted by cloud layer if any)

for every non-air surface (not below non-air block) block:
- index of shading above block (calculated something like daily by evaluating number/types of blocks above it)

for every non-air block:
- index of water content
- index of mineral content, of whatever kind of mineral is in that block (assume no more than one possible mineral per block)

Ideally each of these indexes would need a range of 0-100, or 0-256, to have interesting variation. 0-64 would be enough, if slightly less fun.

Also each tree block will need to be linked to an object (tree part) in a tree it belongs to. The link could work one way (tree objects know, tree blocks don't). That would require just-in-time processing but not much. If a user whacks at a tree block it would probably not take long to do a part lookup as long as there was a sort of tree-trunk listing per chunk to start with.
 

dei

Java Priest
Contributor
Design
Art
thanks for the hint to use bitset, it seems to differ which jvm you use, (or which oppinion you believe). On http://stackoverflow.com/questions/1907318/java-boolean-primitive-type-size they write that boolean[] really uses 1 bit per boolean + a little overhead of course. Perhaps the writer of your article confused something with single booleans which in fact use 32/62 bits but i'll test it using a bitset fits better..

The amount of water would be the same as humidity for the world generator, right? I don't see any problems inserting this additional data into my structure. Requires the same amount of memory per block that is saved as long as air blocks have no properties.

As a side note: Blockindices always should use a power of 2 for that the compiler or JVM is able to optimize address calculations to much more efficient (on most processors) bitshifts. So 128 would fit best in your case

Defining a clean interface for persistence seems reasonable. I'll define it while including my sparseArray.
 

Immortius

Lead Software Architect
Contributor
Architecture
GUI
for every vertical block column:
- index of light falling from sun (should vary by biome/chunk and be impacted by cloud layer if any)
We currently have sunlight index of 0-15, that is 15 anywhere directly under the sky and fading to zero with distance (so into a cave for instance). A per biome modifier can be applied in the tree system if desired, we don't really have clouds at the moment but obviously can be applied as a modifier as well. Day/night cycle also applies a modifier, which is why an area with a sunlight value of 15 appears dark at night.

for every non-air surface (not below non-air block) block:
- index of shading above block (calculated something like daily by evaluating number/types of blocks above it)
In addition to the sunlight index we have a light index, also between 0-15. This covers all other sources of light. This is also needed for above ground blocks due to night.

Both of these indexes only change in relation to a block change - if a placed block obstructs light, or provides light, or a removed block allows light to pass for instance. There's no need to recalculate these otherwise, can just apply situational modifiers. Together they form a byte of data per block.

Ideally each of these indexes would need a range of 0-100, or 0-256, to have interesting variation. 0-64 would be enough, if slightly less fun.
Those ranges are kind of large. 0-7 or 0-15 or even less would be preferable. DF for instance gets away with a range of 0-7 for water level (and probably uses another bit to switch between water and lava). Is there really enough difference in behavior between a mineral content of 24 and 25?

That said we should make it possible for mods to add whatever data they want at the per-block level. Just needs to be recognized this comes with a cost.

Actually, something that might be worth considering is just having a reasonable number of different types of blocks (dirt, clay, gravel, sand, rocky dirt, etc) with various data on their mineral content, how porous they are and so forth, and dispensing with mineral levels entirely. One benefit is when a player is constructing a farm or garden, they have a very visible indicator of the quality of their soil (also avoids the issue of having to track mineral levels in held blocks). And if the soil should change its nature for some reason, this would be done through a block swap and thus also be very visible.

Also each tree block will need to be linked to an object (tree part) in a tree it belongs to. The link could work one way (tree objects know, tree blocks don't). That would require just-in-time processing but not much. If a user whacks at a tree block it would probably not take long to do a part lookup as long as there was a sort of tree-trunk listing per chunk to start with.
Yes. At the moment this sort of thing is handled by creating a look up index of block position -> entity, just accounting for blocks with entities, but some sort of sparse spatial index would be preferable.
 

Immortius

Lead Software Architect
Contributor
Architecture
GUI
On http://stackoverflow.com/questions/1907318/java-boolean-primitive-type-size they write that boolean[] really uses 1 bit per boolean + a little overhead of course.
The top answer pretty clearly states one byte per boolean in an array. :) Certainly can vary per jvm, but considering the performance cost of access values outside of standard memory alignments I would imagine they would use a byte per boolean and leave it to the user to choose when they want per bit storage, via bitset.
 

begla

Project Founder and Lead Developer
Contributor
Architecture
Logistics
Keeping some values as half bytes is one of the key concepts that keeps the overall memory usage down. Storing light values as bytes and all other values too, would result in a far too high memory usage even using a sparse array like data structure.

At least Java uses bytes to store boolean values by an internal optimization of the JVM. Using C++ even booleans are mostly kept as 32-bit values. But it makes total sense when thinking about the architecture of todays CPUs. But as Immortius said... Hail the bitmasks and bitfields! ;) Bitmasks are especially nice because up to 32 or 64 different bits can be kept in the close registers of the CPU. I'm not familiar how much Java-sided optimizations are going on there, but using C++ this can be an significant optimization in the wider context.

Otherwise the sparse array looks okay to me, but I'm concerned that the performance could have an serious impact. Could you post some small benchmarks in comparison to the default data structures?

Regarding the range-check-bug in the TeraArray... Maybe removing range checks in total might be a wise decision. Those are a lot of checks for each access to the array. I was just too lazy at the time to make sure that each of the high-level systems keeps its fingers out of Oblivion. :)
 

Immortius

Lead Software Architect
Contributor
Architecture
GUI
Regarding the range-check-bug in the TeraArray... Maybe removing range checks in total might be a wise decision. Those are a lot of checks for each access to the array. I was just too lazy at the time to make sure that each of the high-level systems keeps its fingers out of Oblivion. :)
Yeah, that is probably a good idea. It is kind of nice to have some checks there during development though - maybe they could be switched to an assert so they can be disabled.

With the world refactor I'm doing at the moment, I've added a "WorldView" class that provides a view into a small region of chunks and takes care of delegating gets and sets to the correct chunk. One advantage is that the view can be offset - a common usage is to get a region of 9 chunks, and treat the middle chunk as the origin chunk for the purpose of processing.
 

begla

Project Founder and Lead Developer
Contributor
Architecture
Logistics
Sounds very nice! Hopefully I'll find the time to dig through the massive amount of changes soon. Caught myself searching for the TeraArray earlier for a little while honestly. :eek:
 

woodspeople

Member
Contributor
Design
Immortius said:
Those ranges are kind of large. 0-7 or 0-15 or even less would be preferable. DF for instance gets away with a range of 0-7 for water level (and probably uses another bit to switch between water and lava). Is there really enough difference in behavior between a mineral content of 24 and 25?
There is a difference if you are sufficiently nerdy ;) But 0-15 is plenty, oh great lord of bit allocation.

Immortius said:
Actually, something that might be worth considering is just having a reasonable number of different types of blocks (dirt, clay, gravel, sand, rocky dirt, etc) with various data on their mineral content, how porous they are and so forth, and dispensing with mineral levels entirely. One benefit is when a player is constructing a farm or garden, they have a very visible indicator of the quality of their soil (also avoids the issue of having to track mineral levels in held blocks). And if the soil should change its nature for some reason, this would be done through a block swap and thus also be very visible.
I had hoped to have trees DEPLETE minerals in an area. However, this COULD be done by converting blocks from IDs with high mineral content (for whatever that tree species considers mineral content) to blocks with low mineral content. With 4096 block types that could be very cool. For example you could imagine that a golden tree depletes the high-grade gold ore blocks it draws gold from into medium-grade, then low-grade gold ore blocks, then high-grade something else, and on down to dirt, all while it fruits solid gold blocks. So yes I like this idea. Not only is it easier to implement, but it might be more fun to play with as well.
 

Cervator

Org Co-Founder & Project Lead
Contributor
Design
Logistics
SpecOps
There's also the potential to hitch a ride on other block data - like that secondary resource value I keep hawking (replacement for the current "state" liquid nibble). That could be oil, gas, etc in lower layers, but moisture in soil near the top and plant-usable mineral content in other blocks around there. While water would eventually soak into soil, then eventually seep down to replenish moisture or arrive at an aquifer non-water resources near the top could provide that depletion functionality or even be the base for when a block flips from a rich type to a poor type, and so on. Player could even repair it with fertilizer :)

Main trick would be mixing and matching water and mineral content - data efficiency would dictate an either/or approach per block type
 
Top