• 1
  • 137
  • 138
  • 139(current)
  • 140
  • 141
  • 167
Release Universal Movie Scraper
(2021-06-14, 20:27)PatK Wrote: Left is universal, right is TMDB

I definitely need to have a look at the country, but for the rest, I think it is a hit & miss depending the scraper landing on the old or the new layout.

The previous scraper version scraped the old layout - it was a hit & miss, because if the scraper landed on the NEW, then it was failing, but if landing on the OLD, then it was ok.
With the new scraper version it is the opposite - it still a hit & miss, because if the scraper lands the NEW layout then it is ok, but if on the OLD one, then it is failing.

I switched over to the new layout, because there were some reports that with the previous scraper (supporting the old layout) there were 90% failings which was somewhat confirmed my tests yesterday.
...but situation might be different today... 

Sorry guys, I am not willing to support both layout, because that would be tremendous effort and also very difficult and time consuming to maintain.
Reply
(2021-06-15, 07:53)T-LO Wrote:
(2021-06-14, 11:20)olympia Wrote: Can you all try scraper version v5.5.0 please and report back?

Kodi 16.1 crashes to desktop without any crash log.

Last line lin Kodi.log is:
07:49:45 T:13016   DEBUG: scraper: GetIMDBDirectorsById returned <details><url cache="tt2039338-main.html" function="ParseIMDBDirectors">...
To be honest, I don't even understand this...
You report a crash for a Kodi version for which this scraper cannot even be installed. UMS v5.5.0 is maintained in Leia scraper repo on purpose.
Reply
(2021-06-15, 10:19)olympia Wrote:
(2021-06-15, 07:53)T-LO Wrote:
(2021-06-14, 11:20)olympia Wrote: Can you all try scraper version v5.5.0 please and report back?

Kodi 16.1 crashes to desktop without any crash log.

Last line lin Kodi.log is:
07:49:45 T:13016   DEBUG: scraper: GetIMDBDirectorsById returned <details><url cache="tt2039338-main.html" function="ParseIMDBDirectors">...
To be honest, I don't even understand this...
You report a crash for a Kodi version for which this scraper cannot even be installed. UMS v5.5.0 is maintained in Leia scraper repo on purpose.
Oh sorry, I got an automatic update yesterday and I thought it came from you.

I just checked and I am on 2.9.13.

This is something else then, right?
Reply
(2021-06-15, 14:49)T-LO Wrote: I just checked and I am on 2.9.13.

Yes, 2.9.13 is the latest UMS version for Kodi Jarvis and it is EOL, thus won't get any fixes.
Reply
I just pushed imdb.common scraper version v3.2.1 with the intention of supporting both old and new layout (I've got fed up with the testing - resulted in busy, hard to follow code, but should work).

Let me know your results once you've got updated.
Reply
(2021-06-15, 16:58)olympia Wrote: I just pushed imdb.common scraper version v3.2.1 with the intention of supporting both old and new layout (I've got fed up with the testing - resulted in busy, hard to follow code, but should work).

Let me know your results once you've got updated.
Thank you for the update.

There is now some intermittent wierdness going on. The country of origin is sometimes scraped correctly and sometimes it is blank. When the country is scraped properly the genre is returning a repeated result e.g. "Action \ Adventure \ Action \ Adventure". When the country is not scraped then genre is correct.

The original title field returns the Japanese title 100% of the time.
Reply
Can someone help me with which skin is showing the country?
Reply
(2021-06-15, 17:53)Zippy79 Wrote: The original title field returns the Japanese title 100% of the time.

This I cannot reproduce.
Reply
(2021-06-15, 22:06)olympia Wrote: Can someone help me with which skin is showing the country?

I'm not sure I follow. I'm just using the default Estuary skin, you can see the country on the movie information page. If you keep refreshing the movie then sometimes it will switch from scraped to not scraped and vice versa (it can take a lot of refreshes to change but it will eventually). I also verify whether it has been scraped by looking at column c21 in the movies table.

EDIT: I've figured out why the country of origin is sometimes not scraped.

This is a snippet of the HTML from the new IMDB site:

html:
href="/search/title/?country_of_origin=US&amp;ref_=tt_dt_cn">United States</a>

and this is a snippet of the HTML from the old IMDB site:

html:
<a href="/search/title?country_of_origin=us&ref_=tt_dt_dt"
>USA</a>

Note the lack of a forward slash after the word title, this means that the regex doesn't match on the old site layout causing it to not get scraped.
Reply
(2021-06-15, 22:06)olympia Wrote:
(2021-06-15, 17:53)Zippy79 Wrote: The original title field returns the Japanese title 100% of the time.

This I cannot reproduce.
I am in Japan so it's possible that IMDB is doing something based on the country you access it from. However, before the site changes the scraper always used to return the correct original title for me and never just the Japanese title for every movie.

I'm trying to step through the scraper code to see why it's doing it but it's pretty difficult to decipher. It's the ParseIMDBAKATitles function that scrapes the original title right?


EDIT: So it's this regex that picks up the original title:

xml:
<expression fixchars="2">meta\sproperty=&quot;og:title&quot;\scontent=&quot;(IMDb\s-\s)?(?:&amp;#x22Wink?([^&quot;]*?)(?:&amp;#x22Wink? \([^\(]*?([0-9]{4})(?:–\s)?\)</expression>

I've confirmed that when the browser language is set to English this regex returns "Raiders of the Lost Ark" and when the browser language is set to Japanese the regex returns "Reidâsu/Ushinawareta âku". So somehow IMDB is feeding the scraper the Japanese version of the page instead of the English version (based on IP address I assume).
Reply
(2021-06-16, 02:52)Zippy79 Wrote:
(2021-06-15, 22:06)olympia Wrote:
(2021-06-15, 17:53)Zippy79 Wrote: The original title field returns the Japanese title 100% of the time.

This I cannot reproduce.
I am in Japan so it's possible that IMDB is doing something based on the country you access it from. However, before the site changes the scraper always used to return the correct original title for me and never just the Japanese title for every movie.

I'm trying to step through the scraper code to see why it's doing it but it's pretty difficult to decipher. It's the ParseIMDBAKATitles function that scrapes the original title right?


EDIT: So it's this regex that picks up the original title:

xml:
<expression fixchars="2">meta\sproperty=&quot;og:title&quot;\scontent=&quot;(IMDb\s-\s)?(?:&amp;#x22Wink?([^&quot;]*?)(?:&amp;#x22Wink? \([^\(]*?([0-9]{4})(?:–\s)?\)</expression>

I've confirmed that when the browser language is set to English this regex returns "Raiders of the Lost Ark" and when the browser language is set to Japanese the regex returns "Reidâsu/Ushinawareta âku". So somehow IMDB is feeding the scraper the Japanese version of the page instead of the English version (based on IP address I assume).

Replacing all instances of

Code:
|accept-language=en-us/

with

Code:
/|accept-language=en-us

(i.e. moving the forward slash to before the pipe) seems to fix this issue.
Reply
Very nice catches @Zippy79, helped me to track down the issues.
Especially the one with accept-language. I am not sure I would've spotted this myself Smile

Try imdb.common v3.2.2
Reply
(2021-06-15, 16:58)olympia Wrote: I just pushed imdb.common scraper version v3.2.1 with the intention of supporting both old and new layout (I've got fed up with the testing - resulted in busy, hard to follow code, but should work).

Let me know your results once you've got updated.

Ratings are scraped correctly with 3.2.1 again, great work. Thanks for the fixes.
Reply
(2021-06-16, 18:45)olympia Wrote: Very nice catches @Zippy79, helped me to track down the issues.
Especially the one with accept-language. I am not sure I would've spotted this myself Smile

Try imdb.common v3.2.2

Thanks for the update. All fields are looking good, except original title - there are four instances of "|accept-language=en-us/" that need changing in metadata.universal/universal.xml. With those changed then original title is scraped correctly.

EDIT: I spoke too soon. This is a HTML snippet from the old IMDB layout:

html:
<meta property='og:title' content="Indiana Jones and the Raiders of the Lost Ark (1981) - IMDb" />

and this is from the new layout:

html:
<meta property="og:title" content="Raiders of the Lost Ark (1981) - IMDb"/>

Note the difference in the quotes around og:title.

(Also I think the old site layout is factually incorrect, the original name was just Raiders of the Lost Ark, but that's beside the point Smile)
Reply
Right, thanks for testing!
I just pushed UMS version v5.5.1 please see if it is any better.
Reply
  • 1
  • 137
  • 138
  • 139(current)
  • 140
  • 141
  • 167

Logout Mark Read Team Forum Stats Members Help
Universal Movie Scraper9