Serializing QGraphicsScene, and JSON

JonB

@SGaist said in Serializing QGraphicsScene, and JSON:

I think you might be over complicating things.

Moi? ;-)

Right, I'll have a QGraphicsItem-derived class instance ending up like this:

class MyGraphicsItem(QGraphicsItem):
    self.saveThisProperty = 999
    self.dontSaveThisProperty = "SGaist"
    self.saveThisObject = ObjectType1()
    self.dontSaveThisObject = ObjectType2()

I would want the serialized XML to look like

<MyGraphicsItem>
    <saveThisProperty>999</saveThisProperty>
    <saveThisObject>
        ...
    </saveThisObject>
</MyGraphicsItem>

Note that I only serialize certain properties/sub-objects, and I do not want to serialize the whole QGraphicsItem from which my class instance is derived.

In XML I'm fine. I know how to output bits of XML as I go along for what I do want, either to a temporary in-memory document which I save at the end or by outputting individual nodes/elements to the stream incrementally. I don't have to serialize everything.

In JSON, I thought you basically call json.serialize(object) or similar on a whole object instance, like one of my MyGraphicsItem, and it goes off and produces the whole thing, obviously including all properties/sub-objects, plus the underlying base QGraphicsItem. Which I don't want!

I don't know how you construct a JSON output "one selective bit at a time", and how you nest sub-objects correctly inside each other, if you do not serialize a whole object to let JSON serialize produce the correct output. That's why I thought maybe you have to first create a standalone object, then copy just those bits you want from the original object into it, and then you can json.serialize(standaloneObject) that object to get what you want.

So how do you "incrementally" & "selectively" produce a JSON output from just desired bits of existing objects (including still getting nesting right)?

Meanwhile, I'm off to read up about JSON serialization now.... :)

SGaist

So, yes you are overcomplicating things quite a lot here.

JSON is just a data format, you have to the job the same as with XML except it looks a bit different.

You'll have something along the lines of:

{
    "MyGraphicsItem": {
        "saveThisProperty": 999,
        "saveThisObject": {
            "thingToSave": "funnyValue",
            "quality": 99.0
        }
    }
}

JonB

@SGaist
I know, but how do you produce that? How do you do "add this node", "skip that node" to produce it? I thought you serialize just one object, i.e. the whole MyGraphicsItem, it does the recursion/inspection. How do I do bits myself as I go along, and selected nested sub-objects? OK, I need to read up, presumably this is nought to do with any available Qt function, I need to find some "Python JSON serialize" set of functions and then I'll understand, do you know what I need to look at?

SGaist

How are you doing it currently with XML ?

I'm guessing are traversing your nodes, construct a dictionary with everything your want in it and then generate the XML, correct ?

It's the same for the json part.

import json

scene_dump = my_method_to_get_the_data(my_scene)

with open("some_file.json", "wt") as json_file):
    json.dump(scene_dump)

scene_dump will contain the dictionary I described in my previous post.

JonB

@SGaist
Then if I understand right your my_method_to_get_the_data(my_scene) is precisely what I called creating my "standalone" semi-copied object, which I have to construct off copying what I actually do want out of each node and reproducing the desired nesting?

I'm guessing are traversing your nodes, construct a dictionary with everything your want in it and then generate the XML, correct ?

No, not for this. I would (probably) do (imaginary language):

xmlStream.writeStartElement("MyGraphicsItem")
    xmlStream.writeElementWithValue("saveThisProperty", 999)
    xmlStream.writeStartElement("saveThisObject")
        ...
    xmlStream.writeEndElement("saveThisObject")
xmlStream.writeEndElement("MyGraphicsItem")

So you see I am doing it "incrementally" and "selectively" to stream. I do not bother to create a dictionary or in-memory object for what I want to save.

That is precisely what I am asking about JSON serialization: can I do it bit-by-bit like for XML stream, or do I need to create a complete in-memory object with just what I want it in order to call a single JSON serialize on the top-level object?? I you could just answer that, I'll know how to proceed.

P.S.
For example, as I start my Googling, question without solution https://stackoverflow.com/questions/46895020/python-serialize-only-specific-fields-to-json

This gives me a json with all the fields.

For one operation I need this json with all the fields , for another I only need name and age .

Is there a way I can specify which fields to ignore and reuse the same class without having to recreate ?

Or this one https://groups.google.com/forum/#!topic/django-users/w7PINeiSAVE:

is there any way to serialize models and remove some fields? I.e. I
would like to serialize User for example, but I definitely don't want
the email to be there.

you can just create your own dict variable, and then using simplejson
to convert it to json

Of course, but that's exactly what I'm trying to avoid...

Sure enough, I'm finding plenty of examples of JSON serialization from Python which want to serialize the whole object, but no luck on how you go about serializing only some of its properties/sub-objects.... :(

SGaist

Well, if you have only a set of reduced properties, I'd go with having a method that returns said data that you feed to the selected serialiser so you separate your concerns.

JonB

@SGaist
I believe I understand you correctly, and this is the conclusion I came to earlier today having done some reading. So now each object type in the hierarchy has a serialization method which is responsible for returning a reduced, "shadow" object which is a copy of those properties it actually wants saved. A single top-level object, "shadow hierarchy", is produced which can be passed to JSON.dump(object, stream) to actually serialize at the end. Feels to me like a kludgy way to serialize, but that does seem to be the Python way to do it.

In C++ you have the << & >> archive/serialization operators. Am I right: for any class you can override them and write whatever you like to the stream for the object serialization? None of this "you have to produce another object containing just what you want serialized"? That's the approach I'm used to (C#/.NET serialization).

SGaist

If you are thinking about the streaming operator, you will have the same issue. For example, the QTextStream operators, you pass it what you want from the object but if you want to serialise for various different format like XML or JSON, you'll have to do that yourself.

So what would be more generic is to define a serialiser base class and then implement serialisers for different format. You would then pass the serialiser to the class and use it to dump whatever data it wants to dump.

You can take inspiration from Django's REST Framework. You define your views, serialisers and decide what renderer to use. So the format used is independent of the data. Note that in the case of this project the "view + serialiser" would be your class and the renderer matches your serialiser.

JonB

@SGaist
Yep, I think I get that. I have only ever needed to support one serialization format at a time, and this time I know it will be JSON and JSON only. It's not worth coding for flexible alternatives here.

The point nonetheless is that the JSON way of serializing, at least via the standard Python JSON.dump, requires you to produce a copy of your object with just the bits you want to be serialized in order to proceed is different from the others, where a custom object simply writes its serialization to the stream.

If I'm not boring you yet :), you wrote that I would expect to be doing:

scene_dump = my_method_to_get_the_data(my_scene)

with open("some_file.json", "wt") as json_file):
    json.dump(scene_dump)

It is the way I have now done it for JSON. But that's not the way I'm used to for serialization. That would be:

with open("some_file.json", "wt") as json_file):
    method_to_walk_objects_serializing_what_they_want_to_stream(my_scene, json_file)

Anyway, I have a direction to proceed for now.

SGaist

You can go on with the second method too. You have json.dumps which takes a dictionary and dumps the content in a string. Since each widget will be its own "entity" you can then prepend it to the file content.

JonB

@SGaist
I'm not quite sure what you mean. I use json.dump() to dump to a stream. The docs for that state:

Note Unlike pickle and marshal, JSON is not a framed protocol, so trying to serialize multiple objects with repeated calls to dump() using the same fp will result in an invalid JSON file.

I'm not sure what they mean by (not) a "framed" protocol, but aren't they saying I cannot call dump() recursively at each level in the recursive descent? Or, perhaps they just mean I cannot do a second, separate dump() to append another object to the final output, because there isn't a top-level, enclosing node?

Meanwhile, I have raised https://stackoverflow.com/questions/59103160/python-json-serialize-excluding-certain-fields. There are suggestions there that I can achieve via json.dump(object, default=serialization_handler), but I don't get how yet....

I found no problem with C# or C++ serialization. I'm finding Python/JSON wilfully obscure, and no examples for something I would have thought I would not be the only person to want... :(

SGaist

Warning: sneaky one char difference: json.dumps <- see the s. It's not the same functionality. This one returns a valid JSON string from your dictionary.

I currently don't know how the dump is implemented for the JSON module but my guess would be that if you call dump twice you will have:

{"first_object": "test"}{"second_object": "other_test"}

Which is not valid JSON. You will have to write yourself the start and end of the document as well as proper separation between the different dumped objects.

JonB

@SGaist
Yeah, I get that bit if that is what they mean by "so trying to serialize multiple objects with repeated calls to dump()".

The dumps() vs dump() should just serialize to string instead of stream, I take it to be a special case which just uses a string stream instead of a file.

I do not have one single dictionary for my hierarchy. I have a whole various-classes hierarchy which needs to be descended to produce the serializaton. The top-level caller of dump() does not even know what classes will be encountered, can't import all classes, and is not the place to write the code for serializing each class anyway. At each class level I need some method in that class to be called which does know how to serialize that class's properties, and call to recurse into its sub-objects. Like, I don't know, say some __toJson__() method in each class. And have dump/dumps() know to call that as it goes. Like C++ would know to call <</>> on each class object as it serializes.

SGaist

I know that you don't have one dict for your whole hierarchy.

What I was suggesting was that you could build that dict up traversing your hierarchy and at the end call json.dump on the returned object. That way, if you need to change the output at some point you don't have to re-implement the traversal, just the "dict to output" part.

JonB

@SGaist
And that is precisely what I implemented on Friday, because I don't know any other way of doing it!

I require all my serializable classes to offer a def json_dump_obj(self) -> dict method. And the recursive descent walker goes if hasattr(obj, 'json_dump_obj'): serialized_obj = obj.json_dump_obj(), and uses that in the serialized object it returns.

So I do a pre-pass complete traversal, returning a "shadow, serializable" tree into a single object which at the end can be passed to json.dump(shadow_obj).

My "uneasiness" is this does not scale nicely (memory-wise) when my object tree has 1,000,000 nodes!

In C++ serialization could have worked this way too. But it does not. For <</>> you would override the operator in each class and each object would serialize itself direct to the archiving stream, not return some "shadow" object for later serialization in one go. No "one pass to get a serialization representation in "shadow" objects built (json_dump_obj()), and then a second call/pass (json.dump()) to serialize that to stream". This is the nub of my question about approach....

SGaist

The dict is nothing JSON specific. Just call that method to_dict or something like that, that will keep it's purpose generic.

Since you may have that many items, I might indeed be unfriendly. What about writing a small serialiser/marshaller class that would manage the file and its content ?

During the traversal you would pass that object along to a serialize method of your class. That would follow your original design more closely.

Note that with a small example we could devise a nice way to do that more easily. Can you write down a dummy small version of your current use case ?

JonB

@SGaist
The dict may not be JSON specific but it is serialization specific, since it is only populated with some subset of properties which are to be serialized. And to work for JSON all its properties/sub-objects must be JSON-serializable. That's why I have at least named the required serialization method json_dump_obj.

I don't really have a million items, or classes :) I will have, say, a dozen items. But there are various classes for the nodes, and various other classes for their sub-objects. I want to keep the serialization of each class inside each class.

When I have time, I will show what I have. Thank you kindly for looking, I will reply to your name here when available :)

JonB

@SGaist
OK, I believe I have finally achieved what I wanted/expected for the serialization approach. The key is the (optional) default=global_serialization_method argument to json.dump() or dumps().

Remember that, for my one million items :), I want an approach which serializes as it descends, rather than a first pass which returns some complete object hierarchy followed by a call to json.dump() to dump that produced hierarchy.

Briefly, code outline is now like:

class ModelScene(QGraphicsScene):

  # Serialize whole scene to JSON into stream
  def json_serialize(self, stream) -> None:
    # Get `json.dump()` to call `ModelScene.json_serialize_dump_obj()` on every object to be serialized
    json.dump(self, stream, indent=4, default=ModelScene.json_serialize_dump_obj)

  # Static method to be called from `json.dump(default=ModelScene.json_serialize_dump_obj)`
  # This method is called on every object to be dumped/serialized
  @staticmethod
  def json_serialize_dump_obj(obj):
    # if object has a `json_dump_obj()` method call that...
    if hasattr(obj, "json_dump_obj"):
      return obj.json_dump_obj()
    # ...else just allow the default JSON serialization
    return obj

  # Return dict object suitable for serialization via JSON.dump()
  # This one is in `ModelScene(QGraphicsScene)` class
  def json_dump_obj(self) -> dict:
    return {
      "_classname_": self.__class__.__name__,
      "node_data": self.node_data
      }
  
class CanvasModelData(QAbstractListModel):

  # Return dict object suitable for serialization via JSON.dump()
  # This one is class CanvasModelData(QAbstractListModel)
  def json_dump_obj(self) -> dict:
    _data = {}
    for key, value in self._data.items():
      _data[key] = value
    return {
      "_classname_": self.__class__.__name__,
      "data_type": self.data_type,
      "_data": _data
      }

The point here is:

Every "complex" class defines a def json_dump_obj(self) -> dict: method.
That method returns just the properties/sub-objects wanted in the serialization.
The top-level json.dump(self, stream, default=ModelScene.json_serialize_dump_obj) causes every node visited to be incrementally serialized to stream, via static method ModelScene.json_serialize_dump_obj. And that calls my obj.json_dump_obj() if available, else default JSON serialization of basic object type.

Interestingly, I came across someone with the same concerns as me. From What is the difference between json.dump() and json.dumps() in python?, solution https://stackoverflow.com/a/57087055/489865:

In memory usage and speed.

When you call jsonstr = json.dumps(mydata) it first creates a full copy of your data in memory and only then you file.write(jsonstr) it to disk. So this is a faster method but can be a problem if you have a big piece of data to save.

When you call json.dump(mydata, file) -- without 's', new memory is not used, as the data is dumped by chunks. But the whole process is about 2 times slower.

Source: I checked the source code of json.dump() and json.dumps() and also tested both the variants measuring the time with time.time() and watching the memory usage in htop.