How I intend to handle a library with 10,000 games
#46
(2015-09-16, 18:12)garbear Wrote: However, game X is independent from game Y. Heimdall's scheduler could exploit this knowledge and execute the rules in this order:

I think I coded in a limiter in the scheduler at some point, not sure if I pushed it though. But IIRC I limited web downloads to only allow X parallel downloads, which effectively forces the scheduler to behave just as you say. Need to dig through the code but I definitely tried it successfully Smile

And the design regarding this was actually because of music and my long time idea of not having to initiate scans and such but rather just always do them. So go through all items and you have an initial library in seconds, and everything extra like fanarts and such comes in when it comes in.
If you have problems please read this before posting

Always read the XBMC online-manual, FAQ and search the forum before posting.
Do not e-mail XBMC-Team members directly asking for support. Read/follow the forum rules.
For troubleshooting and bug reporting please make sure you read this first.

Image

"Well Im gonna download the code and look at it a bit but I'm certainly not a really good C/C++ programer but I'd help as much as I can, I mostly write in C#."
Reply
#47
(2015-09-16, 17:29)da-anda Wrote: At least my vision for a new content related API for add-ons is primarily search capabilities and the possibility to add selected items to the library. With the search API we could show aggregated "recently added" or "trending" media accross add-ons and finally make Kodi itself more usable. The times are over that all media is available offline so browsing the local library is something that's barely done these days (maybe did it once this year while I'm almost daily watching Netflix). And browsing streaming services with millions of entries is also unlikely to be done - thus powerfull search and filter options are required IMO and become a main part of the Kodi UI and not be hidden in a side blade.
Also, the new API should it make easy to explore content, which means it would be good to define media type specific categories/tags that add-ons need to map their equivalents on. That way we can browse games from several add-ons in one go, depending on the category we're after. Like if I feel like playing Arcade stuff, I go into the Arcade category and get an aggregated overview. The main games window or even a category view could look like this:
  • My games (aggregated via a search query)
  • Most popular (aggregated via a search query)
  • new games (aggregated via a search query)
  • by category (list of categories defined by Kodi - could be extended by add-on categories that couldn't be mapped)
  • by platform (platforms aggregated via a search query from add-ons)

Each of these items wouldn't be a dumb VFS item but rather a smart playlist and presented in a FireTV like way (or Netflix, Google TV, AppleTV, they all do basically the same).
So the worst thing we can do is to stick with the VFS for presenting content. It's nice to browse hierarchical stuff in add-ons, but it also has way too many limitations for that and offers a bad user experience (like the "next page" VFS entries that need to be hacked in by add-on authors etc)
Pretty much my "vision" as well, but im not sure the search API is a silver bullet. The real word is ugly and a lot of 'content provides' just dont provide metadata, requires too much work to get it or provides things that don't don't fit anywhere else than in some plugin where you will need custom vfs or scripting. It would be great see these things (aggregated search, scannable content, old plugins) unified as "content add-ons" instead of everyone creating their own separate thing.


(2015-09-16, 18:12)garbear Wrote:
(2015-09-16, 17:29)da-anda Wrote: I'm not sure if you ever tried to import a music library with ~10-20k songs while having the download of additional artist info enabled - this is taking ages. On my PI2 it took like 1 day or more and even on my workstation it's taking hours. And assuming heimdall is adding a simmilar load like fetching artist info, this is a veeeeery bad user experience.

Heimdall has amazing potential. It operates on first-order logic, which has clear, defined data dependencies and thus can be massively parallelized.

Let's say you have two games. And three rules:

Code:
1. If X is a game, a CRC and Platform can be calculated for X.
2. If X has a CRC, the Title and Fanart URL can be found at <insert game web database>
3. If X has a fanart URL, then download and cache the image

Rule 1 might take 15ms for a large CRC. Rule 2 might take 1.5 seconds for a web request. Rule 3 might take 15 seconds for a high-definition image download.

However, game X is independent from game Y. Heimdall's scheduler could exploit this knowledge and execute the rules in this order:

Code:
If X is a game, a CRC and Platform can be calculated for X.
If Y is a game, a CRC and Platform can be calculated for Y.
If X has a CRC, the Title and Fanart URL can be found at <insert game web database>
If Y has a CRC, the Title and Fanart URL can be found at <insert game web database>
If X has a fanart URL, then download and cache the image
If Y has a fanart URL, then download and cache the image

You now have a basic game library in 2*15ms, a detailed game library in 2*1.5s, and a fanart-populated game library in 2*15s.
Is that parallel? Your 6 rule example takes exactly twice as long as the 3 rule. I've barely looked at heimdall and don't have any in-depth knowledge of it but i imagine this is mainly going to be io bound. Looking at the source code, it looks single threaded, with some option for thread pooling. This isn't going to be very fast nor scale very well. It would have more potential if it addressed that, as there is absolutely no reason it should take anywhere near 1 day to import 10000 songs. That is poor design in the current system (or rather ignoring io work altogether)
Reply
#48
(2015-09-17, 13:40)takoi Wrote: Your 6 rule example takes exactly twice as long as the 3 rule.

Not to build a basic library. In the 6 rule example above, the rule for "platform" is 15ms, so you can browse the 2-game library by platform in 2*15ms. In the 3 rule example, the platform for the second game isn't known until the first three rules have been executed, so the time until you can browse by platform is 2*15ms + 1.5s + 15s. See, by building the library in "stages", we can present a more limited library long before the entire scan is complete.

(2015-09-17, 13:40)takoi Wrote: Looking at the source code, it looks single threaded, with some option for thread pooling. This isn't going to be very fast nor scale very well. It would have more potential if it addressed that, as there is absolutely no reason it should take anywhere near 1 day to import 10000 songs. That is poor design in the current system (or rather ignoring io work altogether)

@topfs2 what sort of threading model do you visualize? and should threads be managed by Heimdall or Kodi?
Reply
#49
(2015-09-17, 13:40)takoi Wrote: Is that parallel? Your 6 rule example takes exactly twice as long as the 3 rule. I've barely looked at heimdall and don't have any in-depth knowledge of it but i imagine this is mainly going to be io bound. Looking at the source code, it looks single threaded, with some option for thread pooling. This isn't going to be very fast nor scale very well. It would have more potential if it addressed that, as there is absolutely no reason it should take anywhere near 1 day to import 10000 songs. That is poor design in the current system (or rather ignoring io work altogether)

First I'd like to begin with taking a step back and consider how much cpu do you actually want a background task to eat, even if you have 4 cores do you want all of them on the brink to get your data faster? I had absolutely no problem getting the python process to to 100% (i.e. being python GIL blocked) with the schedule paradigm, and I don't think it will be any problem scaling it beyond that if so desired.

Heimdall can use many mainloops, its written with a classical event loop design. It can run everything one task at a time, waiting for it to finish, or in parallel via thread pool or with minithreads whatever you'd wish. I opted for threadpool as it was easiest and showcases the design well enough.

Its essentially using the same event loop design driving web browsers or gtk apps afaik. Note that python is single threaded when interpreting, much the same as Javascript. Note that this is not necessarily a problem, nodejs is javascript based but massively parallellised even if the scripts are single threaded. The reason it works is because everything heavy takes place in c++, which does not lock the interprenter.

So to showcase, as you say most is io bound, and this is what takes most time to process even if it doesn't take much cpu. So basically you shove all io reading to either multiple threads (in c++) or a select thread. And they callback to python when they are ready, during this time python (and heimdall) can continue with smaller tasks as it wants.

So the important nuance between Heimdall scheduling and our scraper current scraper is that Heimdall can start scraping 100 songs, and have read id3 tags of those 100 songs before it fetches http resources. It can do this with one or more threads. Our current scraper system, afaik, takes one item at a time, so if you have 5 threads (I'm fairly certain our scraper is single threaded) it needs to finish these 5 before it can continue to the next 5. Heimdall can finish part of these 100 before continuing with heavier tasks.

There is basically two things needed for intermediate result and better async support, and none of these are really big ones to add.
  1. Let Task.run accept a callback or be allowed to return a promise. Then Heimdall could continue while c++ works in a completely different thread or process
  2. Engine.run needs an intermediate callback, this would let Heimdall callback when id3 are scanned, before starting the next schedule/execution
  3. Let Tasks return an estimated time so the scheduler lets faster tasks through first

I did trials on this when I worked on it, basically I did a for loop which looped through all my music files with engine.get. If you look at the logs you will notice it has id3 tags before it fetches http resources.
Basically what happens is (lets say we have single thread but you can think how it would work with more threads).

get 1 -> Heimdall schedules id3 task
get 2 -> Heimdall schedules id3 task
..
get N -> Heimdall schedules id3 task

Heimdall executes id3 on 1 -> Heimdall schedules HTTP Request for 1
Heimdall executes id3 on 2 -> Heimdall schedules HTTP Request for 2
...
Heimdall executes id3 on N -> Heimdall schedules HTTP Request for N

== Here we have id3 tags ready for all N to be used and should already have sent out intermediate results ==

Heimdall executes http requst for 1 -> Heimdall executes callback
Heimdall executes http requst for 2 -> Heimdall executes callback
...
Heimdall executes http requst for N -> Heimdall executes callback

== Now all files have all metadata and all files are executed ==

(2015-09-17, 18:11)garbear Wrote: @topfs2 what sort of threading model do you visualize? and should threads be managed by Heimdall or Kodi?

For maximum performance I think the most important part is let io stuff be properly async, which basically means IO does not stop GIL. So move Resource task to C++ essentially.
EDIT: And if this isn't enough we could move parts of the engine, and critical tasks to c++ (say id3) so they can execute without disturbing python. During GSoC I aimed for something portable and main focus was not necessarily maximum performance, but I don't think anything stops the main design from being extremely performant.
If you have problems please read this before posting

Always read the XBMC online-manual, FAQ and search the forum before posting.
Do not e-mail XBMC-Team members directly asking for support. Read/follow the forum rules.
For troubleshooting and bug reporting please make sure you read this first.

Image

"Well Im gonna download the code and look at it a bit but I'm certainly not a really good C/C++ programer but I'd help as much as I can, I mostly write in C#."
Reply
#50
(2015-09-18, 16:13)topfs2 Wrote: For maximum performance I think the most important part is let io stuff be properly async, which basically means IO does not stop GIL. So move Resource task to C++ essentially.
EDIT: And if this isn't enough we could move parts of the engine, and critical tasks to c++ (say id3) so they can execute without disturbing python. During GSoC I aimed for something portable and main focus was not necessarily maximum performance, but I don't think anything stops the main design from being extremely performant.

I'm building the API now, and I'd like to consider performance as much as possible. For example, i'm turning the API calls that process an entire directory to calls that process individual files.

How do you think this api should look? Currently, I scan all the leafs (games) below

Code:
content://content.game.internet.archive/

and add them to the database. Then, each game is available on the contentdb:// VFS (obviously, url encoded):

Code:
contentdb://game?q={_id:1}

Then, ideally, through an interface similar to xbmc.Monitor, i send this path to Heimdall. I'm not sure this will work. I could always call python with "contentdb://game?q={_id:1}" as a parameter, similar to how plugins do it, but how would this communicate with the "master" instance?

Once heimdall gets the path, it calls an API method (still not sure about the function name):

Code:
fileitem = xbmcvfs.getinfo("contentdb://game?q=%7B_id%3A1%7D")

This returns a single fileitem with all the properties known in the database filled out. Whenever heimdall derives more metadata, it calls

Code:
xbmcvfs.setinfo("contentdb://game?q=%7B_id%3A1%7D", fileitem)

to write the metadata back to the database.

Does it look like this could work? Do you see any improvements that could better suit this to heimdall's needs?
Reply
#51
(2015-09-18, 16:13)topfs2 Wrote:
(2015-09-17, 13:40)takoi Wrote: Is that parallel? Your 6 rule example takes exactly twice as long as the 3 rule. I've barely looked at heimdall and don't have any in-depth knowledge of it but i imagine this is mainly going to be io bound. Looking at the source code, it looks single threaded, with some option for thread pooling. This isn't going to be very fast nor scale very well. It would have more potential if it addressed that, as there is absolutely no reason it should take anywhere near 1 day to import 10000 songs. That is poor design in the current system (or rather ignoring io work altogether)

First I'd like to begin with taking a step back and consider how much cpu do you actually want a background task to eat, even if you have 4 cores do you want all of them on the brink to get your data faster? I had absolutely no problem getting the python process to to 100% (i.e. being python GIL blocked) with the schedule paradigm, and I don't think it will be any problem scaling it beyond that if so desired.

Heimdall can use many mainloops, its written with a classical event loop design. It can run everything one task at a time, waiting for it to finish, or in parallel via thread pool or with minithreads whatever you'd wish. I opted for threadpool as it was easiest and showcases the design well enough.

Its essentially using the same event loop design driving web browsers or gtk apps afaik. Note that python is single threaded when interpreting, much the same as Javascript. Note that this is not necessarily a problem, nodejs is javascript based but massively parallellised even if the scripts are single threaded. The reason it works is because everything heavy takes place in c++, which does not lock the interprenter.

So to showcase, as you say most is io bound, and this is what takes most time to process even if it doesn't take much cpu. So basically you shove all io reading to either multiple threads (in c++) or a select thread. And they callback to python when they are ready, during this time python (and heimdall) can continue with smaller tasks as it wants.

So the important nuance between Heimdall scheduling and our scraper current scraper is that Heimdall can start scraping 100 songs, and have read id3 tags of those 100 songs before it fetches http resources. It can do this with one or more threads. Our current scraper system, afaik, takes one item at a time, so if you have 5 threads (I'm fairly certain our scraper is single threaded) it needs to finish these 5 before it can continue to the next 5. Heimdall can finish part of these 100 before continuing with heavier tasks.

There is basically two things needed for intermediate result and better async support, and none of these are really big ones to add.
  1. Let Task.run accept a callback or be allowed to return a promise. Then Heimdall could continue while c++ works in a completely different thread or process
  2. Engine.run needs an intermediate callback, this would let Heimdall callback when id3 are scanned, before starting the next schedule/execution
  3. Let Tasks return an estimated time so the scheduler lets faster tasks through first
*snip*
I'm familiar with how python works:) Threads in python are heavy weight and comes with a lot of overhead, so that you can reach 100% doesn't tell much. And the 'nodejs magic' python can do as well via libraries such as gevent, tornado or asyncio in python 3. Maybe something to try out before handwriting your own GIL free code?

To answer your first "question" what I would want is for it to at any time keep N connection per site concurrently downloading (where n is high enough to max the connection, without destroying the server ofc.). If that maxes out cpu, I don't see any problem with that, Its only one thread anyway. This is just my theory, but for local metadata i think the story is very different. Local reads are already very fast, and I think it's going be hard to gain much, if anything, from doing any of that async or concurrent. On the contrary, I think the kind of sporadic reading from all over the place is one the things that really hurts the performance of the current scanners.
It's great to hear it wasn't an afterthought though and I'm certainly looking forward to seeing this in action! (PS: I meant to say I think anything here stops the design from being performant either. So that is said.)
Reply
#52
(2015-09-19, 00:56)garbear Wrote: Does it look like this could work? Do you see any improvements that could better suit this to heimdall's needs?

Sounds like it could work. I honestly didn't really have that much of a plan with how it would integrate with the library (didn't want to waste to much time during the gsoc on such things Tongue)

I would probably put the library between Kodi and Heimdall though, so the API used by Kodi is not Heimdall but rather the library. Or Heimdall as an extra third party which is triggered on add and sets info from the sideline. But perhaps this is how you intended it too? Or if you have a better idea I missunderstood?

Basically Heimdall was designed to gather metadata of a given resource, that resource can link to other resources which Heimdall could fetch, basically trying to create a sub graph for that resource. But in general these links is useful for understanding deep links. So say if many tracks links to the same album, then they share that data. So I reckon when storing the library needs to consolidate all these links into some form of meta object.

Say a track links to one album on spotify and (the same) on theaudiodb, before storing it would make sense to make the track link to a kodi:albumID which links to both spotify and theaudiodb. This obviously gets a whole lot more complicated if the track links to one spotify album and another on theaudiodb (which are not the same) or multiple albums on spotify but not on theaudiodb. These cases Heimdall does not solve, it just tries to gather information. And this is a problem for all of the semantic web and not limited to Heimdall, universal id's are extremely important.

Honestly I think we can limit these quite a bit by letting a user only select a single metadata site for each content. The Universal Scraper could be of great value here though, as I think it has solved these things?

Heimdall is not designed to recursively fetch data, sort or filter data. Its just meant to gather data on the particular resource its asked to gather for. So for example I do not think it would make sense, nor would it work well, to have a folder task which recursively scan the folder for files and scrape them. It would be fully valid to ask Heimdall to fetch data on a folder though, where it could return items (which you could further scrape if you so desire) it contains and modifed date along with any extra data it could find. Or we could ask it for a meta resource, say artist "Foo Fighter" and it would fetch links to sites fitting that description.

EDIT: Another thing I never solved with Heimdall is searching for possible matches, at the moment its designed around the concept that the metadata is correct. So say we search for the Movie with name X but it turns out we wanted the Movie with name X but another date. It would be useful to scan and let Heimdall fetch all possible resources it could link to, not just the one it thinks it is, to let the user pick the one it is.

EDIT2: And another thing is that even if Heimdall works with links, and that is nice. It really needs to add properties to these links. This is something I never was able to add, I didn't know how to do it nicely in Python. Here is an image of a graph database which contains edges that have data attached to them https://en.wikipedia.org/wiki/Graph_data...yGraph.png
If you have problems please read this before posting

Always read the XBMC online-manual, FAQ and search the forum before posting.
Do not e-mail XBMC-Team members directly asking for support. Read/follow the forum rules.
For troubleshooting and bug reporting please make sure you read this first.

Image

"Well Im gonna download the code and look at it a bit but I'm certainly not a really good C/C++ programer but I'd help as much as I can, I mostly write in C#."
Reply
#53
(2015-09-20, 12:55)takoi Wrote: I'm familiar with how python worksSmile Threads in python are heavy weight and comes with a lot of overhead, so that you can reach 100% doesn't tell much. And the 'nodejs magic' python can do as well via libraries such as gevent, tornado or asyncio in python 3. Maybe something to try out before handwriting your own GIL free code?

Awesome to know! I am honestly not extremely good at Python so any tips on this is highly valued! I don't think it should be overly hard to move to one of those but I'll try to find some time to read the documentation closely! Definetly agree with trying them out before doing it ourselves, makes the code much much more portable.

I have also pondered about rewriting it in JavaScript, as I am better at that and as it might fit better with the web stuff. Or simply rewrite the core stuff again now that I've been away from it awhile (perhaps I would want to solve stuff differently now). Moving to a promised based scheduler would be my first thing to try, it usually makes the code so much nicer in JavaScript, and forces everything to be neatly async.
If you have problems please read this before posting

Always read the XBMC online-manual, FAQ and search the forum before posting.
Do not e-mail XBMC-Team members directly asking for support. Read/follow the forum rules.
For troubleshooting and bug reporting please make sure you read this first.

Image

"Well Im gonna download the code and look at it a bit but I'm certainly not a really good C/C++ programer but I'd help as much as I can, I mostly write in C#."
Reply
#54
(2015-09-10, 16:28)garbear Wrote:
(2015-09-10, 15:50)RockerC Wrote: Does this mean that the effort to design a new DB structure worked on by m.savazzi and others will be abandoned?
sure, you might be addicted to torrenting like crack. but your libraries are never gonna grow to contain 20,000 items. Under the assumptions of this crazy forumula, if the average video file time is around 40 minutes, and the quality is slightly better than the typical cam, you're looking at 14 TERABYTES of video.

*cough* what's wrong with 14TB of video ?

edit: i lie, it's only 9TB, of which 7T is series (the rest is empty).

Stats:
Code:
Albums     : 2278
Artists    : 1483
Songs      : 6087
Movies     : 183
MovieSets  : 33
TVShows    : 98
Episodes   : 5423
Reply
#55
i've seen people with 10k movies and a multitude of episodes. If that's useful is another matter Smile

Example:
http://forum.kodi.tv/showthread.php?tid=...pid2111201
"60TB of harddrives"
Read/follow the forum rules.
For troubleshooting and bug reporting, read this first
Interested in seeing some YouTube videos about Kodi? Go here and subscribe
Reply
#56
30 TB just 3.5 shy of full
Reply
#57
Garbear:
How far have you come with unified content scraping / unified content database? Is there anything I can do to help out in that area?
Reply
#58
(2015-09-24, 14:06)evilhamster Wrote: Garbear:
How far have you come with unified content scraping / unified content database? Is there anything I can do to help out in that area?

Here's my current progress: compare/retroplayer-15.1...retroplayer-game-library

What I've done for the game library:
  • Evaluated (and abandoned) LevelDB, Kyoto Cabinet, UnQLite, MongoDB legacy, and MongoDB master
  • Integrated EJDB into the build system
  • Abstracted EJDB as a generic document store
  • Implemented a NoSQL database on this generic document store
  • Added a virtual content:// protocol, e.g. content://content.game.internet.archive/, that acts just like the plugin:// protocol
  • Added the contentdb:// protocol, which allows reading and writing file items and directories
  • Added a simple content crawler that reads from content:// (breadth-first) and writes to contentdb://
  • Created a schema for games: GameInfoTag.h
  • Added a game library based on XML nodes at special://xbmc/system/library/game/
  • TODO: Need a filter abstraction so that the game library can make queries to contentdb://
  • TODO: Skinning support

What I've done for scrapers:
  • TODO: Need to implement service.metadata, which reads from contentdb://, adds metadata using Heimdall, then writes the updated file items back to contentdb://
  • TODO: Need to integrate Heimdall into kodi: script.module.heimdall
  • TODO: Need to develop Heimdall futher
  • TODO: Need to integrate Hachoir - almost done thanks to Woerd88, see script.module.hachoir

I feel pretty good about the progress I've made this last month. All the basic infrastructure is in place. The only major C++ left is the filtering stuff for database queries.

It's been a fun month of learning NoSQL and epic build system battles, but I'm archiving my progress to return to more pressing matters (like input and GameStream). I plan to return to this project in the next few months, and hopefully have a working game library later this year or early next year.

I appreciate the willingness to help here, but unfortunately not much skinning or Python can be done until the C++ is more mature. There's lots of other places to help out in the meantime, though!
Reply
#59
If you still require information for games, you could use the API of www.igdb.com

https://www.igdb.com/api/v1/documentation

I'm one of the people behind igdb, we would love kodi to use our api Smile

Our API (and website) is still in early development. But we would love to improve it, on your request! You can always contact us trough the website or the forum
Reply
#60
thanks for the heads up, sharpless512

one of the things i'd like to do is avoid string matching on game titles. For example, levenshtein string distance fails when comparing "Super Mario Brothers" against "Super Mario Bros." and Super Mario Bros. 2"

Most ROMs have unique IDs embedded in the ROM. It would be cool if we could look up a game by this ID. I wrote a small utility called PyRomInfo to extract these IDs. Might be of interest to you.
Reply

Logout Mark Read Team Forum Stats Members Help
How I intend to handle a library with 10,000 games0