WIP Merging Duplicate Movies/TV Episodes
#1
Hi

We have multiple drives with different owners hence some duplicated content and multiple clients using a shared database.
For a while now I have been using a patch I created to merge duplicate entries when browsing kodi.

Duplicate movies and tv episodes are merged into a single directory, which when opened displays all the versions.
This works in a similar way to movie sets.

I'm interested in getting this into kodi and I'm willing to spend some time working on this.

In the past I found a post requesting this feature and a dev responded saying this should probably be handled as a plugin (or in the skin) --- I couldn't find this again.
Is this on the right track or should this be an addon?

[UPDATE]
Latest windows built 11th Jan 2016 based on Jarivs b2 here (source)
Previous builds

I'd appreciate if people gave this a try and let me know what you think.
Disclaimer this is work in process and could break things, this does not attempt to modify the database so should work with an existing Isengard Alpha environment.

Gotchas/things to test:
* Discovering the best quality version to report (in skin as HD/SD) is loaded in the background. Fixed by 1ae04a1
* TV episodes are sorted incorrectly when a mix of duplicates and non-duplicates exist for season due to directories being sorted first. - Fixed by 1ae04a1 - However breaks IgnoreArticle (Ignoring the article). Movies seem to be fine.
* Merged TV items will display "TV show information" rather than "Episode Information" - Fixed by 59ce350
* Filtering was broken - Fixed by 9177e52
* Added into Settings->Video two settings to allow Best Video/Audio/Subtitle, Always Ask or Disabled - Fixed by b164a5b
* I broke IgnoreArticle, with sorting (see Gotcha). - Fixed in Jarvis rework
* Ordering of the web interface is messed up by the inclusion of folders - Fixed in Jarvis rework

Remaining Issues

* Stop using CFileItemList and fix copy changed to reference
* Rework "list all duplicates" menu item into Kodi, currently its stealing from the add-on buttons
* Resume info extra is overwritten when best is used by the best item (might be best to leave like this)
* When given multiple best choices it is random as to which is chosen it is probably best to sort by path or something so results are consistent.

Missing Features I would like to add:
* User supplied regex matching to each individual duplicate such that they can be distinguished - currently the path is added to the title
Also see conversations below

Comments, criticisms, feature requests are all welcome.

[UPDATE]
Git repo https://github.com/rsanger/xbmc

Thanks,
Richard Sanger
Reply
#2
Thank You
I believe this kind of feature should be integral part of official Kodi someday.
Reply
#3
A while ago I started working on something similar, see https://github.com/Montellese/xbmc/commi...ow_cleanup (some commets are not related). But I took a different approach where the metadata of an item (movie, episode, ...) could be linked to multiple files. That way you wouldn't have to deal with different metadata for the same item. The implementation in the GUI seems to be similar because when you click on such an item it will open like a movie set and show you all available items.

But I never finished it because there are so many special and edge cases that need attention like which version do you play when a user chooses "Play from here" and other such stuff.
Always read the online manual (wiki), FAQ (wiki) and search the forum before posting.
Do not e-mail Team Kodi members directly asking for support. Read/follow the forum rules (wiki).
Please read the pages on troubleshooting (wiki) and bug reporting (wiki) before reporting issues.
Reply
#4
(2015-03-15, 19:25)Montellese Wrote: A while ago I started working on something similar, see https://github.com/Montellese/xbmc/commi...ow_cleanup (some commets are not related). But I took a different approach where the metadata of an item (movie, episode, ...) could be linked to multiple files. That way you wouldn't have to deal with different metadata for the same item. The implementation in the GUI seems to be similar because when you click on such an item it will open like a movie set and show you all available items.

But I never finished it because there are so many special and edge cases that need attention like which version do you play when a user chooses "Play from here" and other such stuff.

Could you have a Duplicate node for managing similar to the Sets node, this would allow you to remove Duplicates from the Library, and also set which of the duplicates is the Default for actions such as "Play from here" where offering a dialogue to choose is not practical. If there is no user set Default simply use the first item that was added to the Library.

This would negate the need for any attempt at clever logic which will never satisfy everyone.
Reply
#5
(2015-03-15, 19:56)jjd-uk Wrote: Could you have a Duplicate node for managing similar to the Sets node, this would allow you to remove Duplicates from the Library, and also set which of the duplicates is the Default for actions such as "Play from here" where offering a dialogue to choose is not practical. If there is no user set Default simply use the first item that was added to the Library.

This would negate the need for any attempt at clever logic which will never satisfy everyone.

It doesn't really require a special node for that (but obiously the duplicates are eaiser to find there). But we can't expect every user to go through all of his duplicates all the time to set a sane default before running into any of the special/edge cases. It's also not necessarily about how to choose which item to play. The problem is more finding/catching all the edge cases because they are all over the place and almost impossible to track. Obviously what we could do is not even really try to find the special/edge cases and just add support for it and then let the users point out the places that don't work Wink
Always read the online manual (wiki), FAQ (wiki) and search the forum before posting.
Do not e-mail Team Kodi members directly asking for support. Read/follow the forum rules (wiki).
Please read the pages on troubleshooting (wiki) and bug reporting (wiki) before reporting issues.
Reply
#6
Thanks for all the feedback

(2015-03-15, 19:25)Montellese Wrote: A while ago I started working on something similar, see https://github.com/Montellese/xbmc/commi...ow_cleanup (some commets are not related). But I took a different approach where the metadata of an item (movie, episode, ...) could be linked to multiple files. That way you wouldn't have to deal with different metadata for the same item. The implementation in the GUI seems to be similar because when you click on such an item it will open like a movie set and show you all available items.

But I never finished it because there are so many special and edge cases that need attention like which version do you play when a user chooses "Play from here" and other such stuff.

I've taken a look at Montellese branch and it looks very similar to what I have implemented currently.

The use of the database to keep lists of related files seems like a good idea and could reduce load times of a database with many duplicates particularly on load end devices.

I don't like the idea of merging video/sound and subtitles together independently GroupUtils.cpp, different files are likely to be slightly different lengths and would cause out of sync sound etc. NOTE: This actually doesn't work because the stream information is not yet being loaded (see Issue 1 below).

Here's a quick overview of the path through kodi to load a file list.
  1. User clicks (Movies->titles) most logic handled in CGUIMediaWindow::Update()
  2. Background thread grabs items from database (VideoDatabase.cpp::GetMoviesByWhere()) along with media info -- main thread waits for completion updating spinner etc.
  3. Filters and grouping sorting etc. applied
    - Here we group identical items GroupUtils::Group()
  4. Load thumbs and stream info GetGroupedItems::Load() calls CBackgroundInfoLoader::Run() on a background thread
    - We have hooks/callback at start and the end of update
  5. All items are looped through on said background thread, calling CVideoThumbLoader::LoadItemCached() which will get the stream details via CVideoDatabase::GetStreamDetails
  6. For those items not found in cache CVideoThumbLoader::LoadItemLookup() loads stream info from the file directly and caches it for next time
    - For each item when 5-6 are complete we have a hook OnItemLoaded() callback this doesn't seem to be used for GUI update, not 100% when the update gets pushed to the the GUI.
  7. We are done the skin now shows media info such as HD etc.


Issue 1


One of the biggest problems at the moment is that the stream data is not loaded until after the grouping, which is the simplest place to find the best stream to report for HD/SD. This also means that
"play" in the context menu could play this best stream and if we could add back "play from here" (disabled because the items are directories) we could play the best streams we've found.


Solution (1 seems to be the best)
  1. I think jjd-uk's idea of creating a duplicate node as a subclass CFileItem would be useful (This might already exist), then we could maintain a list of subnodes and hence update cached info for all children and select the best in the background step getting stream details. We could also re-implement "play from here" in a sane manner and other edge cases as they arise. Looking at Montellese's branch I think this might be as simple as moving this logic from Group to BackgroundInfoLoader.
  2. Delay grouping until after media stream are loaded on the callback. This seems like a bad way to go as it would likely confuse the user, as items suddenly disappear as they are merged.
  3. Get cached stream details when we initially load the items from the database (i.e. do step 5. in step 1.), however this would still require special handing for the slower non-cached case.

Look and feel

Should we have separate settings switch for TV-shows, Movies and Music Videos?
A bit more complex for the user but could be faster on low-end devices, particularly when listing movies due to the larger number of items.

Should we 1) provide this as a directory if clicked upon as currently implemented, or 2) just default to the best stream and put alternative streams hidden in the context menu?
I like the idea of 2. But we could provide a setting to switch between these modes?

How should we differentiate files, could we show the shortest difference in the path here? And/Or let the users decide some form of regex matching?
- E.g. {/media/driveA/movies/mov.mkv , /media/driveB/mov.mkv} would become {driveA , driveB}
- E.g. {/media/driveA/movies/mov.mkv , /media/driveB/mov.mkv , /media/driveB/movies/mov.mkv} would become {driveA , driveB , driveB/movies }

I'm going to continue working on getting media details for all files (Issue 1) and then most problems after that are tweeks to look and feel and edge cases which we deal with as they are found?

Any more thoughts on this would be appreciated or things I've overlooked.
Reply
#7
One suggestion to integrate some features of the VideoExtras plugin, that merges movies with extras or bonus discs.

If you group different versions of the same file into a virtual directory and show all the files if selected, you may also show specials for this file in here.
So if you have

/media/driveA/movies/movie (2015).3D-TAB.iso
/media/driveB/movies/movie (2015).BLURAY.iso
/media/driveC/movies/movie (2015).DVD.iso
/media/driveC/movies/movie (2015).DVD.extras-BonusDisc.iso
/media/driveC/movies/movie (2015).extras-Making of.mkv

You could show in the selection what file to play for "Movie (2015)" something like this:

3D - driveA
BD - driveB
DVD - driveC
BonusDisc - driveC
Making of - driveC
Reply
#8
(2015-03-21, 05:48)rsanger Wrote: The use of the database to keep lists of related files seems like a good idea and could reduce load times of a database with many duplicates particularly on load end devices.
Yes it makes grouping easier and more unified (e.g. also for other APIs like JSON-RPC). It however has the disadvantage that you e.g. can't store different artwork per version.

(2015-03-21, 05:48)rsanger Wrote: I don't like the idea of merging video/sound and subtitles together independently GroupUtils.cpp, different files are likely to be slightly different lengths and would cause out of sync sound etc.
I don't follow. When you would play such an item you wouldn't play the grouped item but one of the specific versions so the streamdetails in the grouped item wouldn't have any influence on playback but only on how it is displayed in the GUI listing.

(2015-03-21, 05:48)rsanger Wrote: NOTE: This actually doesn't work because the stream information is not yet being loaded (see Issue 1 below).
That's correct.

(2015-03-21, 05:48)rsanger Wrote: Issue 1

One of the biggest problems at the moment is that the stream data is not loaded until after the grouping, which is the simplest place to find the best stream to report for HD/SD. This also means that
"play" in the context menu could play this best stream and if we could add back "play from here" (disabled because the items are directories) we could play the best streams we've found.


Solution (1 seems to be the best)
  1. I think jjd-uk's idea of creating a duplicate node as a subclass CFileItem would be useful (This might already exist), then we could maintain a list of subnodes and hence update cached info for all children and select the best in the background step getting stream details. We could also re-implement "play from here" in a sane manner and other edge cases as they arise. Looking at Montellese's branch I think this might be as simple as moving this logic from Group to BackgroundInfoLoader.
  2. Delay grouping until after media stream are loaded on the callback. This seems like a bad way to go as it would likely confuse the user, as items suddenly disappear as they are merged.
  3. Get cached stream details when we initially load the items from the database (i.e. do step 5. in step 1.), however this would still require special handing for the slower non-cached case.
IMO option 1 is the only way that really makes sense both from a development and a user point of view. Furthermore determining the actual item to be played is something that only really needs to happen after the user has chosen to play the item.

(2015-03-21, 05:48)rsanger Wrote: Should we 1) provide this as a directory if clicked upon as currently implemented, or 2) just default to the best stream and put alternative streams hidden in the context menu?
I like the idea of 2. But we could provide a setting to switch between these modes?
IMO the problem with simply playing one version is that it will be very hard to find a generic way to determine the version to be played. Some people prefer the version with highest resolution, others the version with best audio, others the version in a specific language and so on and so forth. Therefore I think that there's no generic way to do this and we'll have to provide users with the possibility to choose between different approaches and to allow them to override that approach for specific items by manually defining the best version.

(2015-03-21, 05:48)rsanger Wrote: How should we differentiate files, could we show the shortest difference in the path here? And/Or let the users decide some form of regex matching?
- E.g. {/media/driveA/movies/mov.mkv , /media/driveB/mov.mkv} would become {driveA , driveB}
- E.g. {/media/driveA/movies/mov.mkv , /media/driveB/mov.mkv , /media/driveB/movies/mov.mkv} would become {driveA , driveB , driveB/movies }
For a start I was just using the same title with the possibility for users to specify a specific name (e.g. "Director's Cut") per version.
Always read the online manual (wiki), FAQ (wiki) and search the forum before posting.
Do not e-mail Team Kodi members directly asking for support. Read/follow the forum rules (wiki).
Please read the pages on troubleshooting (wiki) and bug reporting (wiki) before reporting issues.
Reply
#9
Thanks for addressing all those points

I've made some progress and have put this up on github.

(2015-03-25, 14:31)Montellese Wrote:
(2015-03-21, 05:48)rsanger Wrote: The use of the database to keep lists of related files seems like a good idea and could reduce load times of a database with many duplicates particularly on load end devices.
Yes it makes grouping easier and more unified (e.g. also for other APIs like JSON-RPC). It however has the disadvantage that you e.g. can't store different artwork per version.

For the time being, I'll focus on my version which doesn't have the database changes. Primarily because this makes it easier for everyone to test. However if it is decided we should modify the database it should be simple enough at the end.

(2015-03-25, 14:31)Montellese Wrote:
(2015-03-21, 05:48)rsanger Wrote: I don't like the idea of merging video/sound and subtitles together independently GroupUtils.cpp, different files are likely to be slightly different lengths and would cause out of sync sound etc.
I don't follow. When you would play such an item you wouldn't play the grouped item but one of the specific versions so the streamdetails in the grouped item wouldn't have any influence on playback but only on how it is displayed in the GUI listing.
Yes you are correct, ignore that comment.

(2015-03-25, 14:31)Montellese Wrote:
(2015-03-21, 05:48)rsanger Wrote: Issue 1

One of the biggest problems at the moment is that the stream data is not loaded until after the grouping, which is the simplest place to find the best stream to report for HD/SD. This also means that
"play" in the context menu could play this best stream and if we could add back "play from here" (disabled because the items are directories) we could play the best streams we've found.


Solution (1 seems to be the best)
  1. I think jjd-uk's idea of creating a duplicate node as a subclass CFileItem would be useful (This might already exist), then we could maintain a list of subnodes and hence update cached info for all children and select the best in the background step getting stream details. We could also re-implement "play from here" in a sane manner and other edge cases as they arise. Looking at Montellese's branch I think this might be as simple as moving this logic from Group to BackgroundInfoLoader.
  2. Delay grouping until after media stream are loaded on the callback. This seems like a bad way to go as it would likely confuse the user, as items suddenly disappear as they are merged.
  3. Get cached stream details when we initially load the items from the database (i.e. do step 5. in step 1.), however this would still require special handing for the slower non-cached case.
IMO option 1 is the only way that really makes sense both from a development and a user point of view. Furthermore determining the actual item to be played is something that only really needs to happen after the user has chosen to play the item.

I've been making some good progress towards this in my branch now on github , I've been using the CFileItemList to hold all merged CFileItem. One annoyance is that it is assumed that all items will be CFileItems and are often copied as such losing any subclass information, so there are some hacks to work around this.

(2015-03-25, 14:31)Montellese Wrote:
(2015-03-21, 05:48)rsanger Wrote: Should we 1) provide this as a directory if clicked upon as currently implemented, or 2) just default to the best stream and put alternative streams hidden in the context menu?
I like the idea of 2. But we could provide a setting to switch between these modes?
IMO the problem with simply playing one version is that it will be very hard to find a generic way to determine the version to be played. Some people prefer the version with highest resolution, others the version with best audio, others the version in a specific language and so on and so forth. Therefore I think that there's no generic way to do this and we'll have to provide users with the possibility to choose between different approaches and to allow them to override that approach for specific items by manually defining the best version.
I quite like the idea of being able to just hit play and say the highest quality video will be selected, however understand this wont work for everyone. I think we could add a couple of modes of operation here, best video, best audio or always ask.

(2015-03-25, 14:31)Montellese Wrote:
(2015-03-21, 05:48)rsanger Wrote: How should we differentiate files, could we show the shortest difference in the path here? And/Or let the users decide some form of regex matching?
- E.g. {/media/driveA/movies/mov.mkv , /media/driveB/mov.mkv} would become {driveA , driveB}
- E.g. {/media/driveA/movies/mov.mkv , /media/driveB/mov.mkv , /media/driveB/movies/mov.mkv} would become {driveA , driveB , driveB/movies }
For a start I was just using the same title with the possibility for users to specify a specific name (e.g. "Director's Cut") per version.

The sounds like a good idea. I guess this would require some extra storage in the database, so this is probably one of the last aspects I'll focus on.
Reply
#10
(2015-03-31, 13:29)rsanger Wrote: I've been using the CFileItemList to hold all merged CFileItem. One annoyance is that it is assumed that all items will be CFileItems and are often copied as such losing any subclass information, so there are some hacks to work around this.

What is the advantage of holding the merged CFileItem instances in a CFileItemList instance? IMO this sounds like a major change in the CFileItemList class which is used all over the place and every one of these places probably assumes that every item is a CFileItem and not a CFileItemList and has quite some potential for weird bugs.
Always read the online manual (wiki), FAQ (wiki) and search the forum before posting.
Do not e-mail Team Kodi members directly asking for support. Read/follow the forum rules (wiki).
Please read the pages on troubleshooting (wiki) and bug reporting (wiki) before reporting issues.
Reply
#11
(2015-03-31, 13:46)Montellese Wrote: What is the advantage of holding the merged CFileItem instances in a CFileItemList instance? IMO this sounds like a major change in the CFileItemList class which is used all over the place and every one of these places probably assumes that every item is a CFileItem and not a CFileItemList and has quite some potential for weird bugs.

The main reason for this is to store the items that made up the group such that the background thread loading streaminfo and thumbnail loading can access each and get the streaminfo. Keeping the entire item makes it simple to call LoadInfo...() on each and then pick the best etc. It also means we have the path if we want to default to playing the best item.

Another option for dealing with this would be to keep a list of the file paths against a CFileItem and recreate these items as needed from this.
Reply
#12
(2015-03-31, 14:08)rsanger Wrote:
(2015-03-31, 13:46)Montellese Wrote: What is the advantage of holding the merged CFileItem instances in a CFileItemList instance? IMO this sounds like a major change in the CFileItemList class which is used all over the place and every one of these places probably assumes that every item is a CFileItem and not a CFileItemList and has quite some potential for weird bugs.

The main reason for this is to store the items that made up the group such that the background thread loading streaminfo and thumbnail loading can access each and get the streaminfo. Keeping the entire item makes it simple to call LoadInfo...() on each and then pick the best etc. It also means we have the path if we want to default to playing the best item.

Another option for dealing with this would be to keep a list of the file paths against a CFileItem and recreate these items as needed from this.

IMO you only need the information about all the available versions when the user wants to play/queue directly from the grouped list. In all other cases that information is useless and for that the change seems rather risky. It should be enough to retrieve all the necessary streamdetails and file paths once the user initiates that action instead of having to keep it around all the time.
Always read the online manual (wiki), FAQ (wiki) and search the forum before posting.
Do not e-mail Team Kodi members directly asking for support. Read/follow the forum rules (wiki).
Please read the pages on troubleshooting (wiki) and bug reporting (wiki) before reporting issues.
Reply
#13
a little OT but also partly related to CFileItemList discussed atm. I'd like to see the possibility to show grouped items with expanded subitems and CFileItemList seems like enabling this, right? A usecase would be to drop the separator item from our settings system and forward the settings groups to skins and let skins decide how to display these groups (with separator, without, ...). Next usecase would be for an improved search feature to show results grouped, like a music search could have a CFileItemLists for each content type (Artists, Albums, Songs, Genre, ...) that contain the related search results and display them on the same search result page (a similar to the global search add-on)
Reply
#14
(2015-04-02, 17:58)Montellese Wrote:
(2015-03-31, 14:08)rsanger Wrote:
(2015-03-31, 13:46)Montellese Wrote: What is the advantage of holding the merged CFileItem instances in a CFileItemList instance? IMO this sounds like a major change in the CFileItemList class which is used all over the place and every one of these places probably assumes that every item is a CFileItem and not a CFileItemList and has quite some potential for weird bugs.

The main reason for this is to store the items that made up the group such that the background thread loading streaminfo and thumbnail loading can access each and get the streaminfo. Keeping the entire item makes it simple to call LoadInfo...() on each and then pick the best etc. It also means we have the path if we want to default to playing the best item.

Another option for dealing with this would be to keep a list of the file paths against a CFileItem and recreate these items as needed from this.

IMO you only need the information about all the available versions when the user wants to play/queue directly from the grouped list. In all other cases that information is useless and for that the change seems rather risky. It should be enough to retrieve all the necessary streamdetails and file paths once the user initiates that action instead of having to keep it around all the time.

I will try to remove CFileItemList, I'll probably end up putting some extra fields into CFileItem to hold the paths(but not the full CFileItems). I don't know the code very well and wouldn't want to break things with this change.

(2015-03-31, 13:29)rsanger Wrote:
(2015-04-02, 17:58)Montellese Wrote:
(2015-03-31, 13:29)rsanger Wrote: How should we differentiate files, could we show the shortest difference in the path here? And/Or let the users decide some form of regex matching?
- E.g. {/media/driveA/movies/mov.mkv , /media/driveB/mov.mkv} would become {driveA , driveB}
- E.g. {/media/driveA/movies/mov.mkv , /media/driveB/mov.mkv , /media/driveB/movies/mov.mkv} would become {driveA , driveB , driveB/movies }
For a start I was just using the same title with the possibility for users to specify a specific name (e.g. "Director's Cut") per version.

The sounds like a good idea. I guess this would require some extra storage in the database, so this is probably one of the last aspects I'll focus on.
I've noticed that the existing Context Menu->Manage->Edit Title already does this.

PS: I've also added a updated windows build if any wants to test it, see updates to the original post. Check out Settings->Video to enable selecting the Autoplay Best Video vs Always Ask which duplicate to play.
Reply
#15
(2015-04-07, 09:36)rsanger Wrote: I will try to remove CFileItemList, I'll probably end up putting some extra fields into CFileItem to hold the paths(but not the full CFileItems). I don't know the code very well and wouldn't want to break things with this change.
I still don't see why you need all the paths in the CFileItem. You only really need them when you want to play/queue the item and then you can simply get all the paths that you need. The only disadvantage I can see with that approach is that it needs to be handled differently for video and music items.

(2015-04-07, 09:36)rsanger Wrote: I've noticed that the existing Context Menu->Manage->Edit Title already does this.

Yes but that will not work without adjustments as it will adjust the title of the movie but that will affect all versions because we only store that information once and link the different files to that information. So we'll need a way to store an additional version-specific title.
Always read the online manual (wiki), FAQ (wiki) and search the forum before posting.
Do not e-mail Team Kodi members directly asking for support. Read/follow the forum rules (wiki).
Please read the pages on troubleshooting (wiki) and bug reporting (wiki) before reporting issues.
Reply

Logout Mark Read Team Forum Stats Members Help
Merging Duplicate Movies/TV Episodes1