Serializing QGraphicsScene, and JSON

JonB

@SGaist
Yep, I think I get that. I have only ever needed to support one serialization format at a time, and this time I know it will be JSON and JSON only. It's not worth coding for flexible alternatives here.

The point nonetheless is that the JSON way of serializing, at least via the standard Python JSON.dump, requires you to produce a copy of your object with just the bits you want to be serialized in order to proceed is different from the others, where a custom object simply writes its serialization to the stream.

If I'm not boring you yet :), you wrote that I would expect to be doing:

scene_dump = my_method_to_get_the_data(my_scene)

with open("some_file.json", "wt") as json_file):
    json.dump(scene_dump)

It is the way I have now done it for JSON. But that's not the way I'm used to for serialization. That would be:

with open("some_file.json", "wt") as json_file):
    method_to_walk_objects_serializing_what_they_want_to_stream(my_scene, json_file)

Anyway, I have a direction to proceed for now.

SGaist

You can go on with the second method too. You have json.dumps which takes a dictionary and dumps the content in a string. Since each widget will be its own "entity" you can then prepend it to the file content.

JonB

@SGaist
I'm not quite sure what you mean. I use json.dump() to dump to a stream. The docs for that state:

Note Unlike pickle and marshal, JSON is not a framed protocol, so trying to serialize multiple objects with repeated calls to dump() using the same fp will result in an invalid JSON file.

I'm not sure what they mean by (not) a "framed" protocol, but aren't they saying I cannot call dump() recursively at each level in the recursive descent? Or, perhaps they just mean I cannot do a second, separate dump() to append another object to the final output, because there isn't a top-level, enclosing node?

Meanwhile, I have raised https://stackoverflow.com/questions/59103160/python-json-serialize-excluding-certain-fields. There are suggestions there that I can achieve via json.dump(object, default=serialization_handler), but I don't get how yet....

I found no problem with C# or C++ serialization. I'm finding Python/JSON wilfully obscure, and no examples for something I would have thought I would not be the only person to want... :(

SGaist

Warning: sneaky one char difference: json.dumps <- see the s. It's not the same functionality. This one returns a valid JSON string from your dictionary.

I currently don't know how the dump is implemented for the JSON module but my guess would be that if you call dump twice you will have:

{"first_object": "test"}{"second_object": "other_test"}

Which is not valid JSON. You will have to write yourself the start and end of the document as well as proper separation between the different dumped objects.

JonB

@SGaist
Yeah, I get that bit if that is what they mean by "so trying to serialize multiple objects with repeated calls to dump()".

The dumps() vs dump() should just serialize to string instead of stream, I take it to be a special case which just uses a string stream instead of a file.

I do not have one single dictionary for my hierarchy. I have a whole various-classes hierarchy which needs to be descended to produce the serializaton. The top-level caller of dump() does not even know what classes will be encountered, can't import all classes, and is not the place to write the code for serializing each class anyway. At each class level I need some method in that class to be called which does know how to serialize that class's properties, and call to recurse into its sub-objects. Like, I don't know, say some __toJson__() method in each class. And have dump/dumps() know to call that as it goes. Like C++ would know to call <</>> on each class object as it serializes.

SGaist

I know that you don't have one dict for your whole hierarchy.

What I was suggesting was that you could build that dict up traversing your hierarchy and at the end call json.dump on the returned object. That way, if you need to change the output at some point you don't have to re-implement the traversal, just the "dict to output" part.

JonB

@SGaist
And that is precisely what I implemented on Friday, because I don't know any other way of doing it!

I require all my serializable classes to offer a def json_dump_obj(self) -> dict method. And the recursive descent walker goes if hasattr(obj, 'json_dump_obj'): serialized_obj = obj.json_dump_obj(), and uses that in the serialized object it returns.

So I do a pre-pass complete traversal, returning a "shadow, serializable" tree into a single object which at the end can be passed to json.dump(shadow_obj).

My "uneasiness" is this does not scale nicely (memory-wise) when my object tree has 1,000,000 nodes!

In C++ serialization could have worked this way too. But it does not. For <</>> you would override the operator in each class and each object would serialize itself direct to the archiving stream, not return some "shadow" object for later serialization in one go. No "one pass to get a serialization representation in "shadow" objects built (json_dump_obj()), and then a second call/pass (json.dump()) to serialize that to stream". This is the nub of my question about approach....

SGaist

The dict is nothing JSON specific. Just call that method to_dict or something like that, that will keep it's purpose generic.

Since you may have that many items, I might indeed be unfriendly. What about writing a small serialiser/marshaller class that would manage the file and its content ?

During the traversal you would pass that object along to a serialize method of your class. That would follow your original design more closely.

Note that with a small example we could devise a nice way to do that more easily. Can you write down a dummy small version of your current use case ?

JonB

@SGaist
The dict may not be JSON specific but it is serialization specific, since it is only populated with some subset of properties which are to be serialized. And to work for JSON all its properties/sub-objects must be JSON-serializable. That's why I have at least named the required serialization method json_dump_obj.

I don't really have a million items, or classes :) I will have, say, a dozen items. But there are various classes for the nodes, and various other classes for their sub-objects. I want to keep the serialization of each class inside each class.

When I have time, I will show what I have. Thank you kindly for looking, I will reply to your name here when available :)

JonB

@SGaist
OK, I believe I have finally achieved what I wanted/expected for the serialization approach. The key is the (optional) default=global_serialization_method argument to json.dump() or dumps().

Remember that, for my one million items :), I want an approach which serializes as it descends, rather than a first pass which returns some complete object hierarchy followed by a call to json.dump() to dump that produced hierarchy.

Briefly, code outline is now like:

class ModelScene(QGraphicsScene):

  # Serialize whole scene to JSON into stream
  def json_serialize(self, stream) -> None:
    # Get `json.dump()` to call `ModelScene.json_serialize_dump_obj()` on every object to be serialized
    json.dump(self, stream, indent=4, default=ModelScene.json_serialize_dump_obj)

  # Static method to be called from `json.dump(default=ModelScene.json_serialize_dump_obj)`
  # This method is called on every object to be dumped/serialized
  @staticmethod
  def json_serialize_dump_obj(obj):
    # if object has a `json_dump_obj()` method call that...
    if hasattr(obj, "json_dump_obj"):
      return obj.json_dump_obj()
    # ...else just allow the default JSON serialization
    return obj

  # Return dict object suitable for serialization via JSON.dump()
  # This one is in `ModelScene(QGraphicsScene)` class
  def json_dump_obj(self) -> dict:
    return {
      "_classname_": self.__class__.__name__,
      "node_data": self.node_data
      }
  
class CanvasModelData(QAbstractListModel):

  # Return dict object suitable for serialization via JSON.dump()
  # This one is class CanvasModelData(QAbstractListModel)
  def json_dump_obj(self) -> dict:
    _data = {}
    for key, value in self._data.items():
      _data[key] = value
    return {
      "_classname_": self.__class__.__name__,
      "data_type": self.data_type,
      "_data": _data
      }

The point here is:

Every "complex" class defines a def json_dump_obj(self) -> dict: method.
That method returns just the properties/sub-objects wanted in the serialization.
The top-level json.dump(self, stream, default=ModelScene.json_serialize_dump_obj) causes every node visited to be incrementally serialized to stream, via static method ModelScene.json_serialize_dump_obj. And that calls my obj.json_dump_obj() if available, else default JSON serialization of basic object type.

Interestingly, I came across someone with the same concerns as me. From What is the difference between json.dump() and json.dumps() in python?, solution https://stackoverflow.com/a/57087055/489865:

In memory usage and speed.

When you call jsonstr = json.dumps(mydata) it first creates a full copy of your data in memory and only then you file.write(jsonstr) it to disk. So this is a faster method but can be a problem if you have a big piece of data to save.

When you call json.dump(mydata, file) -- without 's', new memory is not used, as the data is dumped by chunks. But the whole process is about 2 times slower.

Source: I checked the source code of json.dump() and json.dumps() and also tested both the variants measuring the time with time.time() and watching the memory usage in htop.