[PROPOSAL] Extensible Database Backend for Retroplayer
#1
Name: Cole

Summary: Database For Retroplayer

How will I achieve this: I will use SQLAlchemy+Protobuffs+ZeroMQ to create a language agnostic API for Retroplayer data

What will the project focus on:

Extensibility. Most programmers very much dislike SQL and do whatever they can to stay away from it. By abstracting the database away from the developer I will create an easily extensible interface. The API will be defined by the protobuffs definition. If you change the protobuff definition, or add a new message the database will change to meet the protobuff specification. The protobuffs message defines the database structure.

Benefits:

A backend such as this will allow other developers to focus on the core logic of Retroplayer without having to worry about the database backend. This segmentation and tool chain also has the built-in ability to operate across TCP, with any of the numerous languages that are supported by both protobuffs and zeroMQ. When successful it also has the ability to serve as a model for improvements to the Kodi persistent storage model.

This backend will have the ability to build a database from scratch. No user or dev interaction with the database will be needed.

Goals:

1. Discuss implementation and design.
2. Draft initial message design and relationship model, write protobuff messages.
3. Write SQLAlchemy MetaClasses
4. Testing/Documentation
5. Write query API
6. More testing/documentation, will have a goal of 100% code coverage.
7. Implement security best practices and optimize speed.

What does it touch in Kodi: Nothing, it will provide an API for persistent storage agnostic of database implementation.
Possible mentors: garbear

I have professional experience with both SQLAlchemy and ZeroMQ. I know the pitfalls and gotchas of both APIs. I designed an API that used a message system that is similar to protobuffs. It leveraged both SQLAlchemy and ZeroMQ to create a persistent storage solution used for a distributed system of robots at a major US research facility. I feel that this model, after implemented successfully in retroplayer can replace the current persistent data model in Kodi. Becuase of the implementation's ability to create new tables and relationships based upon a simple message definition it will be easy for add-ons to also use the database.

While not in my realm of expertise, I can also design the user-facing portions of the library as well. I do, however, want my focus to be on implementing a rock solid well tested backend. I see this as a start to being an active developer of a project I have used for many years.
Reply
#2
Hey, thanks for your proposal.

I have used protobuf and ZeroMQ (between threads, over IPC, TCP and UDP) in the past as well but I've never heard of SQLAlchemy so it would be nice if you could elaborate on that a bit.
Is your plan to write a library that can be used in Kodi to communicate with an external tool/service (backend) that implements the database. Can you give a more detailed overview how you would map the database interactions to protobuf messages and how we would handle different classes/objects to be stored in the database? And how would the backend be implemented (C++, python, ...) and with what database engine/system?
Always read the online manual (wiki), FAQ (wiki) and search the forum before posting.
Do not e-mail Team Kodi members directly asking for support. Read/follow the forum rules (wiki).
Please read the pages on troubleshooting (wiki) and bug reporting (wiki) before reporting issues.
Reply
#3
(2015-03-23, 22:44)Montellese Wrote: Hey, thanks for your proposal.

I have used protobuf and ZeroMQ (between threads, over IPC, TCP and UDP) in the past as well but I've never heard of SQLAlchemy so it would be nice if you could elaborate on that a bit.
Is your plan to write a library that can be used in Kodi to communicate with an external tool/service (backend) that implements the database. Can you give a more detailed overview how you would map the database interactions to protobuf messages and how we would handle different classes/objects to be stored in the database? And how would the backend be implemented (C++, python, ...) and with what database engine/system?

Quote:I have used protobuf and ZeroMQ (between threads, over IPC, TCP and UDP)
I have not worked with protobuffs other than a test case see if we could switch over our proprietary message protocol to them. It ended up being more trouble than it was worth for every developer on my team other than me. I am looking forward to working with them on this project.

I have done basic implementations of ZMQ, but would definitely seek some guidance on how to structure the message system. Right now I am thinking push/pull for most of the data. For something that is done on a callback (i.e. media progress) a pub/sub would probably be appropriate.


Quote:I've never heard of SQLAlchemy so it would be nice if you could elaborate on that a bit.

SQLAlchemy is a ORM for Python. It manages the relationship between objects and the database. It gives us pooling, security, as well as many other awesome features for free. It is used in many lightweight web frameworks. Think Django ORM, but with the ability to dig down deep into the database structure and behavior. As long as no database specific features (i.e. JSONB for postgreSQL) are used it is agnostic to the database used. (Thought poped in my head, this might be a good base for a HTML5 client, but I digress). There is an awesome community supporting the project, and the documentation is spectacular, once you start to become comfortable with it.

Quote:Is your plan to write a library that can be used in Kodi to communicate with an external tool/service (backend) that implements the database

Yes and no. The GSOC project would not be the place to completely separate the persistent data in Kodi from the core application. However, this would be a start. Over version iterations, the move could be made to move all non-client specific persistent data to this model if that is the direction the project wants to go. I think Retroplayer would be a perfect test case.

Quote:Can you give a more detailed overview how you would map the database interactions to protobuf messages and how we would handle different classes/objects to be stored in the database?

This is an area where I have to do some more research and have a few nights with a full pot of coffee, but here is the basic idea.

When a message arrives it is turned into an instance of a class. If a SQLAlchemy mapping already exists for that message/instance it is either inserted or updated to the database -- The ORM takes care of this. If a mapping does not exist I will construct the class, this will in turn construct the table in the database. The data can now be inserted into the appropriate table. There is some magic that will need to be done with abstract base classes and metaclasses to make sure all of the relationships map correctly. Python, being a language that is not type safe is perfect for this type of behavior. I also plan on storing any image blobs in the database, as it seems as internal blobs tend to have better performance in SQLite

Quote:And how would the backend be implemented (C++, python, ...) and with what database engine/system?

It would be written in Python. I see this being an I/O bound procedure that can be multiprocess, so Python should be fine. I had no issue with it in my work where the data throughput requirements are much higher. Frankly the database does not matter. I have the most experience with Postgresql, but that seems way too heavy for this application. I will test it on different database engines (only takes a change of one line of code) to figure out what has the best performance, and suites the needs of the project best. Right now I am thinking SQLite would be the best fit due to it's wide support in embedded systems and small binary size. The only drawback I can see is that SQLite does not allow concurrent writes -- I don't think this will be an issue for this application. Another bonus is that we can do in memory databases for testing!

These are the best APIs I know for the job. If anybody has any input or more questions let me know!
Reply
#4
Just wanted to add this here as well to show what could be possible. Using a replication protocol such as the one below could allow all of the clients automatically sync. You could add a new kodi client to your network by simply installing the binary. Network discovery and replication of the database would make everything just work...but again this is an example of a future possibility and beyond the scope of gsoc.

http://www.philipotoole.com/replicating-...consensus/
Reply
#5
(2015-03-23, 23:49)colek42 Wrote:
Quote:I've never heard of SQLAlchemy so it would be nice if you could elaborate on that a bit.
SQLAlchemy is a ORM for Python. It manages the relationship between objects and the database. It gives us pooling, security, as well as many other awesome features for free. It is used in many lightweight web frameworks. Think Django ORM, but with the ability to dig down deep into the database structure and behavior. As long as no database specific features (i.e. JSONB for postgreSQL) are used it is agnostic to the database used. (Thought poped in my head, this might be a good base for a HTML5 client, but I digress). There is an awesome community supporting the project, and the documentation is spectacular, once you start to become comfortable with it.

Quote:Is your plan to write a library that can be used in Kodi to communicate with an external tool/service (backend) that implements the database
Yes and no. The GSOC project would not be the place to completely separate the persistent data in Kodi from the core application. However, this would be a start. Over version iterations, the move could be made to move all non-client specific persistent data to this model if that is the direction the project wants to go. I think Retroplayer would be a perfect test case.
Though your idea is localized to only RetroPlayer, isn't this proposal really the start of a global code refactoring project = bad as a GSOC idea (even if it might be good in the long run)?

Also, SQLAlchemy is written in Python, so how would that would affect Kodi's database performance on slower embedded systems if this was used for very large music/video databases?

Since Kodi's application core is written in C++ and otherwise mostly only dependent on C libraries, would it not be best to begin by evaluating different ORM options and in that evaluation include ORMs written in C++ or C? See discussion here about suggestion to search around and evaluate several ORMs anyway http://forum.kodi.tv/showthread.php?tid=221599
Reply
#6
Quote:Though your idea is localized to only RetroPlayer, isn't this proposal really the start of a global code refactoring project = bad as a GSOC idea (even if it might be good in the long run)?

My project is limited in scope. The proposal is for the backend for retroplayer. If in the future it turns into something bigger I will support it, but that is not part of the proposal.

Quote:Also, SQLAlchemy is written in Python, so how would that would affect Kodi's database performance on slower embedded systems if this was used for very large music/video databases?

Example on performance.

(From SQL Alchemy Docs)
Insert/Update
classics-MacBook-ProConfusedqlalchemy classic$ /usr/local/src/pypy-2.1-beta2-osx64/bin/pypy test.py
SQLAlchemy ORM: Total time for 100000 records 5.88369488716 secs
SQLAlchemy ORM pk given: Total time for 100000 records 3.52294301987 secs
SQLAlchemy Core: Total time for 100000 records 0.613556146622 secs
sqlite3: Total time for 100000 records 0.442467927933 sec

For the numbers of records that we are talking about there no performance implications. If there are situations where bulk inserts are needed I can implement a method to insert them in Core which has little overhead.

The ORM will only have overhead for the records that it touches. So if the GUI is only displaying 100 records, only 100 records need to be fetched. The response time for this, on any system will be at most 100ms, and would not differ much from a C++ ORM. A C++ library that directly accesses the database would be much faster, reletivly, but is very hard to debug and extend.

My previous work was in GIS, where one data structure might contain a 1000 points, each with attributes, and with methods that had to be run on them upon insert and return. Performance was just fine on an i5.

Quote:Since Kodi's application core is written in C++ and otherwise mostly only dependent on C libraries, would it not be best to begin by evaluating different ORM options and in that evaluation include ORMs written in C++ or C? See discussion here about suggestion to search around and evaluate several ORMs anyway http://forum.kodi.tv/showthread.php?tid=221599

The problem with C++ in this application is that is is typed and compiled. In my implimentation the database structure will be resolved from the proto buff message definitions. This will make it very easy for both Kodi core developers, and add-on developers. With python I can create a class type at run time, because I can let a class define the table structure and relations this class factory will also create a table for persistent storage. I can not think of a straightforward way to implement this pattern in C++.

While I have never worked with a ORM in C++, it certainly does not sound like fun. Would a C++ implementation probably be a bit faster, yes absolutely -- but I doubt it will be noticeable. The trade off is going to come with a much higher development time. I can write/debug/test python about 10x faster than C++, and about 5x faster than Java.

Like I said, I have written a similar solution to this, though much more complicated and with a much higher expected load. It is working extremely well.



I hope I addressed your concerns, let me know what you think.
Reply
#7
(2015-03-23, 23:49)colek42 Wrote: I have not worked with protobuffs other than a test case see if we could switch over our proprietary message protocol to them. It ended up being more trouble than it was worth for every developer on my team other than me. I am looking forward to working with them on this project.
My experience with protobuf is that it's very nice once your messages have reached their final state but are a lot of hassle during the development if you forget to adjust the protobuf definition etc.

(2015-03-23, 23:49)colek42 Wrote: I have done basic implementations of ZMQ, but would definitely seek some guidance on how to structure the message system. Right now I am thinking push/pull for most of the data. For something that is done on a callback (i.e. media progress) a pub/sub would probably be appropriate.
I have only really used PUB/SUB but the whole setup was a bit special and complicated.

(2015-03-23, 23:49)colek42 Wrote: Yes and no. The GSOC project would not be the place to completely separate the persistent data in Kodi from the core application. However, this would be a start. Over version iterations, the move could be made to move all non-client specific persistent data to this model if that is the direction the project wants to go. I think Retroplayer would be a perfect test case.
Yeah I didn't expect to replace the whole persistence logic in Kodi. That would be too much for a GSoC project.

(2015-03-23, 23:49)colek42 Wrote: It would be written in Python. I see this being an I/O bound procedure that can be multiprocess, so Python should be fine. I had no issue with it in my work where the data throughput requirements are much higher. Frankly the database does not matter. I have the most experience with Postgresql, but that seems way too heavy for this application. I will test it on different database engines (only takes a change of one line of code) to figure out what has the best performance, and suites the needs of the project best. Right now I am thinking SQLite would be the best fit due to it's wide support in embedded systems and small binary size. The only drawback I can see is that SQLite does not allow concurrent writes -- I don't think this will be an issue for this application. Another bonus is that we can do in memory databases for testing!

(2015-03-24, 00:38)colek42 Wrote: Just wanted to add this here as well to show what could be possible. Using a replication protocol such as the one below could allow all of the clients automatically sync. You could add a new kodi client to your network by simply installing the binary. Network discovery and replication of the database would make everything just work...but again this is an example of a future possibility and beyond the scope of gsoc.

http://www.philipotoole.com/replicating-...consensus/

(2015-03-24, 16:12)colek42 Wrote:
Quote:Also, SQLAlchemy is written in Python, so how would that would affect Kodi's database performance on slower embedded systems if this was used for very large music/video databases?

Example on performance.

(From SQL Alchemy Docs)
Insert/Update
classics-MacBook-ProConfusedqlalchemy classic$ /usr/local/src/pypy-2.1-beta2-osx64/bin/pypy test.py
SQLAlchemy ORM: Total time for 100000 records 5.88369488716 secs
SQLAlchemy ORM pk given: Total time for 100000 records 3.52294301987 secs
SQLAlchemy Core: Total time for 100000 records 0.613556146622 secs
sqlite3: Total time for 100000 records 0.442467927933 sec

For the numbers of records that we are talking about there no performance implications. If there are situations where bulk inserts are needed I can implement a method to insert them in Core which has little overhead.
Inserts/Updates aren't really of much interest in Kodi because we don't really write that much to the database. Most slow SQL queries we have right now are reads with a lot of JOINs (like getting all movies with all their actors etc).

A few more general points that come to mind from internal discussions we had in the team about how we should do the persistence logic in Kodi and how potential server/client setups could look like:
  • Most users won't have a dedicated server machine where the backend would run on separated from clients.
  • It should be possible to use Kodi standalone or in a server/client setup.
  • Users shouldn't have to manually setup a database unless they have a more advanced setup in which case they usually know what they are doing.
  • Ideally users would be able to have a Kodi installation which is temporarily separated from the rest of the infrastructure (backend/server), watch something, come back and have their updated state synced with the backend/server. From what I remember about Raft that would not be possible because it denies updates in a partitioned setup that doesn't have enough followers.
  • We use a lot of filtering with SQL queries (especially for smartplaylists) which is a problem when thinking towards NoSQL.
Always read the online manual (wiki), FAQ (wiki) and search the forum before posting.
Do not e-mail Team Kodi members directly asking for support. Read/follow the forum rules (wiki).
Please read the pages on troubleshooting (wiki) and bug reporting (wiki) before reporting issues.
Reply
#8
There was some really good work that went into a python add-on backend library replacement for Movies, Music, and TV Shows with Mediabrowser server as a backend. Even on powerful systems there is a not-insignificant amount of lag when moving through different menus. On RPi and arm boxes there is a lot of lag, more or less unusable. Here are some relevant threads and posts on it
http://mediabrowser.tv/community/index.p...oad-times/
http://mediabrowser.tv/community/index.p...ntry169090
http://mediabrowser.tv/community/index.p...ntry169202
http://mediabrowser.tv/community/index.p...ntry169208
http://mediabrowser.tv/community/index.p...odi-addon/

Not sure how relevant all this would be for your proposal, but they had to do a lot just to get it to perform well on reasonably powerful systems
Reply
#9
Quote:Inserts/Updates aren't really of much interest in Kodi because we don't really write that much to the database. Most slow SQL queries we have right now are reads with a lot of JOINs (like getting all movies with all their actors etc).

SQlite is perfect for this use case then. It is also embedded in the python distro.

Quote:[*]Most users won't have a dedicated server machine where the backend would run on separated from clients.
[*]It should be possible to use Kodi standalone or in a server/client setup.

I have been looking into using a raft consensus algorithm for distribution of data in a swarm for another project I am working on. I think it may be appropriate here. It would allow synchronizing the different XBMC nodes without having a server. A headless node would be fairly trivial top implement with the persistent data separated. The raft consensus algorithm is a bit much to explain in this post, but think page --> https://raftconsensus.github.io/ explains it well. I have to do a bit more research to figure out the best way it could be implemented. It will add overhead to writes and updates, but will have no effect on reads, joins, etc.

Quote:[*]Users shouldn't have to manually setup a database unless they have a more advanced setup in which case they usually know what they are doing.

Developers not working on this portion of Kodi should never have to touch it. If a user has to touch the database I have failed. (Fun fact. the first time I had to write a line of SQL was installing a multi-client XBMC installation years ago)

Quote:[*]Ideally users would be able to have a Kodi installation which is temporarily separated from the rest of the infrastructure (backend/server), watch something, come back and have their updated state synced with the backend/server. From what I remember about Raft that would not be possible because it denies updates in a partitioned setup that doesn't have enough followers.

This is the split brain problem. What happens when that person then moves to a connected agent and watches something that updates a row in the same table. It can be resolved by timestamps, but is a complicated issue. It would definitely be useful for mobile devices It is something that will need to be researched to see what others have done.

Right now I am thinking that if the agent is completely disconnected from the other agents, through network discovery it should then know that it is no longer part of a network. When it connects and sees the other agents it should then roll all of the updates onto them as long as the timestamp is > the current row. This is just a thought process. Volumes of code and papers have been written on this problem. I am going to defer to that research when it comes time to solve it.

Quote:[*]We use a lot of filtering with SQL queries (especially for smartplaylists) which is a problem when thinking towards NoSQL.
[/list]

I don't think NoSQL makes sense in most situations. When I have something that works well as a key value pair I just use either JSONB, or HSTORE in Postgres and get to have the best of both worlds. I'm sure somebody could write an implementation that would work well in NoSQL, but I'm not the guy.

The way I implemented queries with filters before would be a query message. It would look something like this (abstracted to XML - you would not send an XML, but an instance of the MessageQuery protobuff class)


<MessageQuery>
<Req ID>Some UUID</>
<Message>Movies</>
<Filter>Actor='Mila Kunis', Date<01012008, Genre='Comedy', Name<'Ted', Rating='PG-13'</>
<Limit>100</> //For large datasets we want to chunk the data as the user scrolls though it
<MessageQuery>

Return would be a message that contains some data relating to the query and an array of protobuff movie messages. All messages would extend a KodiMessage[/quote] base class to make this possible

I found in my last project that the majority of queries are simple like this. For more advanced queries for which it does not make sense to make a parser there would be another Message you could send with a where clause
Reply
#10
(2015-03-24, 22:31)Dark_Slayer Wrote: There was some really good work that went into a python add-on backend library replacement for Movies, Music, and TV Shows with Mediabrowser server as a backend. Even on powerful systems there is a not-insignificant amount of lag when moving through different menus. On RPi and arm boxes there is a lot of lag, more or less unusable. Here are some relevant threads and posts on it

Not sure how relevant all this would be for your proposal, but they had to do a lot just to get it to perform well on reasonably powerful systems

Quote:
Code:
NumbCalls    TotalTime   CumulativeTime    FunctionName
12139        5.334716    17.656464         JSONObject
165280       3.662913    17.656831         iterscan
11989        1.874261    2.646259          get
886          1.803213    1.803213          <method 'recv' of '_socket.socket' objects>
122831       1.707222    1.707222          <_json.scanstring>
52658        0.988096    2.020283          JSONString
12661        0.846247    0.846247          <method 'getProperty' of 'xbmcgui.Window' objects>
148540       0.788216    0.788216          <built-in method scanner>
371592       0.753044    0.753044          <built-in method end>
1103         0.585379    17.642762         JSONArray
216          0.533063    3.883612          fastItem
1            0.486818    24.753601         getContent
82640        0.438145    0.438145          <built-in method match>
5068         0.378452    0.378452          <method 'setProperty' of 'xbmcgui.ListItem' objects>
158479       0.31497     0.31497           <getattr>
7102         0.281647    0.417157          JSONNumber

1. It looks like their issues were with JSON. Protobuffs have some speed issues with Python, but it looks like people have solved it.
2. Also, the front end should be asking for the data in chunks, not the entire contents of a library. They have a huge amount of calls, it looks like they were parsing an entire library.
3. There will always be a local copy of the library.
4. I have a raspberryPi around here somewhere. I will run some test code on it and my i7 and post the results. Am I right to assume that the rPi is about the slowest platform Kodi will run on?
5. Deployment will use pypy which will be 2-3x faster than cPython for this type of code.
Reply
#11
(2015-03-25, 00:38)colek42 Wrote: 4. I have a raspberryPi around here somewhere. I will run some test code on it and my i7 and post the results. Am I right to assume that the rPi is about the slowest platform Kodi will run on?
5. Deployment will use pypy which will be 2-3x faster than cPython for this type of code.

We generally don't recommend anything under the Pi2 now, so if it works reasonably fast on a Pi, that'll be pretty good news.
Reply
#12
This test is for inserts, which has much more overhead than reads as SQLAlchemy uses a "unit of work" which "goes far beyond a simple insert"

For most cases, the SqlAlchemy ORM model will be used

On my i7 dev machine:
SqlAlchemy ORM: Total time for 1000 records 0.208664894104 secs
SqlAlchemy ORM pk given: Total time for 1000 records 0.183818101883 secs
SqlAlchemy Core: Total time for 1000 records 0.142248868942 secs
sqlite3: Total time for 1000 records 0.15887093544 sec

On RaspberyPi ver 1B, No overclocking
SqlAlchemy ORM: Total time for 1000 records 4.9611260891 secs
SqlAlchemy ORM pk given: Total time for 1000 records 3.21597981453 secs
SqlAlchemy Core: Total time for 1000 records .235366106033 secs
sqlite3: Total time for 1000 records .237324953079 sec

Top shows ~100% cpu usage and 4% memory usage for pi. It also looks to be nearly perfect O(n) time complexity.


1. Note: this is without compiling the ctypes for sqlalchemy, which will improve speed. It is also not using pypy which will also greatly improve the speed.

2. Test code is here: https://github.com/colek42/CSC205/blob/m...emyTest.py is is taken from the SQLAlchemy documentation

3. It does not include the overhead of protobufs and ZMQ, both of which should be negligible.

4. The results are actually better than I expected on the old rPi. I certainly thought my i7 would perform faster than ~20x the pi.

5. It might be a good idea to place a hard upper limit on the number of rows returned to prevent people from coding bad queries in addons.
Reply
#13
(2015-03-24, 16:12)colek42 Wrote: SQLAlchemy ORM: Total time for 100000 records 5.88369488716 secs

For the numbers of records that we are talking about there no performance implications.


I'm looking at a torrent right now named "26,000_Amiga_Roms" and another named "+25000 roms + emulators", both < 10gb, so 100000 is within an order of magnitude of numbers we could potentially see
Reply
#14
(2015-03-25, 04:44)garbear Wrote:
(2015-03-24, 16:12)colek42 Wrote: SQLAlchemy ORM: Total time for 100000 records 5.88369488716 secs

For the numbers of records that we are talking about there no performance implications.


I'm looking at a torrent right now named "26,000_Amiga_Roms" and another named "+25000 roms + emulators", both < 10gb, so 100000 is within an order of magnitude of numbers we could potentially see


Reads will be an order of magnitude or so faster than writes.

On my i7
Code:
SqlAlchemy ORM: Total time for 1000 records 0.2338078022 secs
SqlAlchemy ORM: Total time for reading 1000 records 0.00899386405945 secs
SqlAlchemy ORM pk given: Total time for 1000 records 0.19210600853 secs
SqlAlchemy Core: Total time for 1000 records 0.150679111481 secs
sqlite3: Total time for 1000 records 0.150583982468 sec



Also good practice will be to ask for records in chunks to maintain snappy performance. There is no way to display thousands of ROMs on the screen at once, and should not ask the database to give us all of them. The gui should ask for a certain number of records above and below what is currently displayed. If we need all of the titles at once and reads become a problem (I don't see this happening) I can always use SQLAlchemy Core and bypass the ORM to get some more speed.[/quote]
Reply
#15
Pretty cool.
Reply

Logout Mark Read Team Forum Stats Members Help
[PROPOSAL] Extensible Database Backend for Retroplayer0