2018-08-15, 21:39
The v3.0.7 IMDB metadata scraper function ParseIMDBOutline is not returning a proper result for web pages containing a "See full summary..." link within the outline.
Example listing of "A Fork in the Road (2009)" https://www.imdb.com/title/tt1117386/?ref_=nv_sr_1
the function returns
12:22:17.193 T:524 DEBUG: scraper: ParseIMDBOutline returned <details><outline>Will, an escaped convict, inadvertently takes refuge in a barn the same night the owners, April and Martin, get into a terrible fight. A gun shot goes off inside the house. April drags ...
See full summary »</outline></details>
Adding a new regex to the function so it looks like:
returns a more readable:
12:31:42.499 T:4648 DEBUG: scraper: ParseIMDBOutline returned <details><outline>Will, an escaped convict, inadvertently takes refuge in a barn the same night the owners, April and Martin, get into a terrible fight. A gun shot goes off inside the house. April drags ...</outline></details>
Regards,
AT2010
Example listing of "A Fork in the Road (2009)" https://www.imdb.com/title/tt1117386/?ref_=nv_sr_1
the function returns
12:22:17.193 T:524 DEBUG: scraper: ParseIMDBOutline returned <details><outline>Will, an escaped convict, inadvertently takes refuge in a barn the same night the owners, April and Martin, get into a terrible fight. A gun shot goes off inside the house. April drags ...
See full summary »</outline></details>
Adding a new regex to the function so it looks like:
Code:
<ParseIMDBOutline dest="5">
<RegExp input="$$2" output="<details>\1</details>" dest="5">
<RegExp input="$$1" output="<outline>\1</outline>" dest="2">
<expression fixchars="1" trim="1"><div class="summary_text">(.+?)<div\sclass</expression>
</RegExp>
<RegExp input="$$1" output="<outline>\1</outline>" dest="2">
<expression fixchars="1" trim="1"><div class="summary_text">(.+?)<a\shref="[^"]*"\s*>Add\sa\sPlot</expression>
</RegExp>
<RegExp input="$$1" output="<outline>\1</outline>" dest="2">
<expression fixchars="1" trim="1"><div class="summary_text">(.+?)<a\shref="(.+?)=tt_ov_pl"</expression>
</RegExp>
<expression noclean="1" />
</RegExp>
</ParseIMDBOutline>
returns a more readable:
12:31:42.499 T:4648 DEBUG: scraper: ParseIMDBOutline returned <details><outline>Will, an escaped convict, inadvertently takes refuge in a barn the same night the owners, April and Martin, get into a terrible fight. A gun shot goes off inside the house. April drags ...</outline></details>
Regards,
AT2010