ScraperEdit for XBMC (Java)

  Thread Rating:
  • 1 Vote(s) - 5 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Post Reply
beamer145 Offline
Junior Member
Posts: 14
Joined: Jan 2013
Reputation: 0
Post: #16
Sorry for the delay in responding
I think i had "jre-6u29-windows-x64.exe" before ( as that is the only other installer I have in my java dir, unless this thing automatically updates itself, but then I would have had 7u11 to start with).

Any suggestions about why the editor stops after the 2e nested regexp ?
(note really critical for me at this moment as my construction seems to work fine, just wondering)
find quote
UsagiYojimbo Offline
Member
Posts: 91
Joined: Feb 2010
Reputation: 2
Location: Debrecen, Hungary
Post: #17
(2013-01-21 21:30)beamer145 Wrote:  After upgrading to jdk-7u11 the problem seems solved.
JRE 1.6 includes JAXB 2.0 or 2.1, while JRE 1.7 includes JAXB 2.2, and somehow these versions (un)marshal XML differently.
As of 0.2.55 I added a workaround to correclty unmarshal in both JRE's. If anyone find a JRE/JDK that still produces this kind of error, please report it to me!

(2013-01-21 21:30)beamer145 Wrote:  When I select my CreateSearchUrl and run Scrape or Debug, he seems to stop after my second nested regexp. No errors are reported and the dest buffer of this second regexp is not yet filled in (button remains grayed out). I am not sure what is going on, I suppose it should have continued all the way ?
It was caused by a minor bug in logging, I corrected this as well.
There is a log file in the user's home folder (on Win7 it is: c:\Users\<user name>\ScraperEdit.0.log).

(2013-01-21 21:30)beamer145 Wrote:  ( This one converts MovieNameInCamelCase_YEAR formatted folders to %20 seperated words, it works but unfortunately you need to rebuild XBMC with the unwanted ToLower operation on the file name commented out for it to work, but this is not a problem for your scraper debugger)
Wouldn't be it simpler doing something like this:
Code:
<FunctionName dest="2">...
        <RegExp dest="2" output="%20\1" input="$$1">
            <expression clear="yes" repeat="yes" cs="true">([A-Z][a-z0-9]*)</expression>
        </RegExp>
        <RegExp dest="2+" output="%20\1" input="$$1">
            <expression>_([0-9]+)</expression>
        </RegExp>...
    </FunctionName>
E.G: "MovieNameInCamelCase_2013" -> "%20Movie%20Name%20In%20Camel%20Case%202013"
(This post was last modified: 2013-01-30 01:26 by UsagiYojimbo.)
find quote
beamer145 Offline
Junior Member
Posts: 14
Joined: Jan 2013
Reputation: 0
Post: #18
Quote:It was caused by a minor bug in logging, I corrected this as well.

Thanks, I will check it out later today

Quote:Wouldn't be it simpler doing something like this:
Code:
<FunctionName dest="2">...
        <RegExp dest="2" output="%20\1" input="$$1">
            <expression clear="yes" repeat="yes" cs="true">([A-Z][a-z0-9]*)</expression>
        </RegExp>
        <RegExp dest="2+" output="%20\1" input="$$1">
            <expression>_([0-9]+)</expression>
        </RegExp>...
    </FunctionName>
E.G: "MovieNameInCamelCase_2013" -> "%20Movie%20Name%20In%20Camel%20Case%202013"

Yes, for that example it would work.
But for others you start to run into problems
2001ASpaceOdyssey_1968 -> will not match at all because it does not start with a capital
BackInTheUSSR_1992 -> Back%20In%20The%20U%20S%20S%20R, you would prefer Back%20In%20The%20USSR
TheNumber23_2007 -> The%20Number23 instead of The%20Number%23

Which required making an OR of regexp's which I probably did not solve in the most elegant way, but it seems to work Smile
But I am always interested in a better way, the use of the _'s is superdirty.
find quote
UsagiYojimbo Offline
Member
Posts: 91
Joined: Feb 2010
Reputation: 2
Location: Debrecen, Hungary
Post: #19
(2013-01-31 12:07)beamer145 Wrote:  Yes, for that example it would work.
But for others you start to run into problems
2001ASpaceOdyssey_1968 -> will not match at all because it does not start with a capital
BackInTheUSSR_1992 -> Back%20In%20The%20U%20S%20S%20R, you would prefer Back%20In%20The%20USSR
TheNumber23_2007 -> The%20Number23 instead of The%20Number%23

Which required making an OR of regexp's which I probably did not solve in the most elegant way, but it seems to work Smile
But I am always interested in a better way, the use of the _'s is superdirty.

Well CamelCase and acronyms are an interesting combinations. Most CamelCase rulset require that acronyms are also subject to CamelCase, thus only the first letter should be in upper case...
It is just hard for a program to detect an all upper case acronym in a CamelCase sentence correctly. See: "USAHereIAm"...

As for the first example, it matches, but the number 2001 is left out of the result. Try:
Code:
<FunctionName dest="2">...
        <RegExp dest="2" output="%20\1" input="$$1">
            <expression clear="yes" repeat="yes" cs="true">([0-9]+|[A-Z][a-z]*)</expression>
        </RegExp>
        <RegExp dest="3" output="\2" input="$$2">
            <expression>^(%20)?(.*)$</expression>
        </RegExp>...
    </FunctionName>
These RegExp's will match correctly, except the acronyms. And also matches the year, so you should remove the year beforehand...
find quote
beamer145 Offline
Junior Member
Posts: 14
Joined: Jan 2013
Reputation: 0
Post: #20
Hi,

I wanted to show that my dirty construction is able to deal with USAHereIAm, but I run into a strange issue.
Your regexp tester and debugger seem to give a different result. The regexp tester is OK but the debugger is not .


http://imgur.com/jb221bg

( Slot 6 contains %20USAHereIAm%20%20, the same value i entered in the regexp tester. If I enter the values I get in the regexp tester in the output string, then I do not get what is shown in slot 7 of the debugger)

EDIT: Ah, I think I found the problem. Your tool does not show the case sensitive flag as set, while I have cs ="yes' in my regexp.
And in your previous post, you use cs="true", so I am assuming that is what you are expecting.
If i look at the xbmc sources though, it says :

const char* sensitive = pExpression->Attribute("cs");
if (sensitive)
if (stricmp(sensitive,"yes") == 0)
bInsensitive=false; // match case sensitive

Maybe is was different in an old version, but now is is "yes" instead of "true" you should be looking for .
(and apparently the regexp tester is always case sensitive, even if the flag of the expression you are testing is not set -> confusing ?)
(This post was last modified: 2013-02-02 20:08 by beamer145.)
find quote
UsagiYojimbo Offline
Member
Posts: 91
Joined: Feb 2010
Reputation: 2
Location: Debrecen, Hungary
Post: #21
(2013-02-02 19:58)beamer145 Wrote:  EDIT: Ah, I think I found the problem. Your tool does not show the case sensitive flag as set, while I have cs ="yes' in my regexp.
And in your previous post, you use cs="true", so I am assuming that is what you are expecting.
My version of the undocumented features (such as the attribute cs) is based on the output of Scraper Editor.
It was interesting that while all other attribute used "yes", this one was set to "true"... I thought it was deliberate.

I set up my mind to read through the Scraper Editor thread, to find new hints and features.
The first I have found is the tag include... This will be done by loading functions from included libraries just before the scraping/debugging starts, and dropping them after one closes the debugger.
find quote
Marx1 Offline
Fan
Posts: 367
Joined: Jan 2011
Reputation: 3
Post: #22
I've tried opening this scraper
https://filmweb-lite.googlecode.com/svn/...b-lite.xml
but no luck. No error, completely nothing, jut like I couldn't click "open" at all. I use ScraperEdit-0.1.2-55.zip on Linux and Java 1.6.
find quote
UsagiYojimbo Offline
Member
Posts: 91
Joined: Feb 2010
Reputation: 2
Location: Debrecen, Hungary
Post: #23
(2013-02-06 15:01)Marx1 Wrote:  I've tried opening this scraper
https://filmweb-lite.googlecode.com/svn/...b-lite.xml
but no luck. No error, completely nothing, jut like I couldn't click "open" at all. I use ScraperEdit-0.1.2-55.zip on Linux and Java 1.6.

There was error message in the log file. (There is a log file in the user's home folder, on Win7 it is: c:\Users\<user name>\ScraperEdit.0.log.)
Next release will pop up a message box on this kind of errors. Thanks for spotting.

As for the scraper You are trying to load, it contains invalid XML. Here is the error message from the log file:
Code:
org.xml.sax.SAXParseException; systemId: file:/T:/Scrapers/filmweb-lite.xml; lineNumber: 216; columnNumber: 185; The reference to entity "language" must end with the ';' delimiter.
The line in question is:
Quote: <RegExp conditional="sets" input="$$14" output="&lt;url function=&quot;SETS_TMDB&quot;&gt;http://api.themoviedb.org/3/movie/\1?api_key=1009b5cde25c7b0692d51a7db6e49cbd&language=pl&lt;/url&gt;" dest="8+">
As You can see, there is a ampersand (&) which is not quoted as &amp;.
find quote
Marx1 Offline
Fan
Posts: 367
Joined: Jan 2011
Reputation: 3
Post: #24
It's working scraper anyway Smile so probably xbmc uses less strict library to parse xml.
I think about nfo generator and art downloader for movies. It would be possible to freely exchange scraper (taking it from xbmc). Do you think I can use your library to do that?
(This post was last modified: 2013-02-06 22:58 by Marx1.)
find quote
UsagiYojimbo Offline
Member
Posts: 91
Joined: Feb 2010
Reputation: 2
Location: Debrecen, Hungary
Post: #25
(2013-02-06 22:53)Marx1 Wrote:  It's working scraper anyway Smile so probably xbmc uses less strict library to parse xml.
Yeah, XBMC uses somewhat different XML and RegExp engines..
I am using JAXB (Java API for XML Binding, built into the Java runtime), and Java built in RegExp engines.
BTW, many scrapers have this kind of errors...

(2013-02-06 22:53)Marx1 Wrote:  I think about nfo generator and art downloader for movies. It would be possible to freely exchange scraper (taking it from xbmc). Do you think I can use your library to do that?
This whole thing is heavily under development, and still in it alpha/beta state, you are free to use it, as long you can call Java code from your project.
However, please consider the license (Creative Commons Attribute ShareAlike License 3.0).
find quote
Marx1 Offline
Fan
Posts: 367
Joined: Jan 2011
Reputation: 3
Post: #26
I'm trying to discus it with another developer of media manager in java http://forum.xbmc.org/showthread.php?pid...pid1238230
It would be nice if you could refactor code separating GUI and editing code from launching part forming scraper library. Such library would be able to open scraper file (from stream), execute search and return what it find in usable/readable form.
Anyway I tried to use universal scraper (some error while executing), tvdb and filmweb (nothing found - I was searching for Avatar , later Friends)
(This post was last modified: 2013-02-07 22:38 by Marx1.)
find quote
UsagiYojimbo Offline
Member
Posts: 91
Joined: Feb 2010
Reputation: 2
Location: Debrecen, Hungary
Post: #27
(2013-02-07 22:35)Marx1 Wrote:  It would be nice if you could refactor code separating GUI and editing code from launching part forming scraper library. Such library would be able to open scraper file (from stream), execute search and return what it find in usable/readable form.

You may have noticed that there is a JAR archive called XbmcAddons.jar. This contains the JAXB bindings, and most part of the scraping engine.
Please, check out the sources from the SVN repository.
find quote
Marx1 Offline
Fan
Posts: 367
Joined: Jan 2011
Reputation: 3
Post: #28
Your repository has strange configuration. Why "xml" isn't in "src" subdirectory in XbmcAddons?

I've run tests with built-in XBMC tmdb scraper and tmdb html page. It works, hovewer it doesn't seem to scrape properly, here is scrape-result.xml:

Code:
<details><title></title><chain function="GetTMDBTitleByIdChain"></chain><originaltitle></originaltitle><url function="ParseF
allbackTMDBRuntime" cache="tmdb-en-.json">http://api.themoviedb.org/3/movie/?api_key=57983e31fb435df4df77afb854740ea9&amp;la
nguage=en</url><runtime>



  The Birds &mdash; The Movie Database







      var language = "en";
      var locale = "us";
      var image_url = "http://d3a8mw37cqal2z.cloudfront.net/images/";
      var cdn_url = "http://d3gtl9l2a4fn1j.cloudfront.net/";
      var cdn_path = "t/p/"


(here is almost full page)


          
0</runtime><chain function="GetTMDBStudioByIdChain"></chain><chain function="GetTMDBCountryByIdChain"></chain><chain functio
n="GetTMDBDirectorsByIdChain"></chain><chain function="GetTMDBWitersByIdChain"></chain><chain function="GetTMDBCertification
sByIdChain"></chain><chain function="GetTMDBSetByIdChain"></chain><chain function="GetTMDBPlotByIdChain"></chain><chain func
tion="GetTMDBTaglineByIdChain"></chain><chain function="GetTMDBCastByIdChain"></chain><chain function="GetTMDBGenresByIdChai
n"></chain><chain function="GetTMDBThumbsByIdChain"></chain><chain function="GetTMDBFanartByIdChain"></chain><chain function
="GetTMDBTrailerByIdChain"></chain></details>
(This post was last modified: 2013-02-08 11:22 by Marx1.)
find quote
UsagiYojimbo Offline
Member
Posts: 91
Joined: Feb 2010
Reputation: 2
Location: Debrecen, Hungary
Post: #29
(2013-02-08 10:17)Marx1 Wrote:  Your repository has strange configuration. Why "xml" isn't in "src" subdirectory in XbmcAddons?
That is because that project has 2 different sources folders: src and xml, the first is the engine, and some tools, while the latter being the JAXB bindings. If You open it in NetBeans (I use v7.2), You will see what I mean.

(2013-02-08 10:17)Marx1 Wrote:  I've run tests with built-in XBMC tmdb scraper and tmdb html page. It works, hovewer it doesn't seem to scrape properly, here is scrape-result.xml:
The scraper engine currently does not run the functions mentioned in the result url tags. It is a future feature.
(This post was last modified: 2013-02-09 08:20 by UsagiYojimbo.)
find quote
UsagiYojimbo Offline
Member
Posts: 91
Joined: Feb 2010
Reputation: 2
Location: Debrecen, Hungary
Post: #30
(2013-02-08 10:17)Marx1 Wrote:  I've run tests with built-in XBMC tmdb scraper and tmdb html page. It works, hovewer it doesn't seem to scrape properly, here is scrape-result.xml:
Code:
...
<details><title></title><chain function="GetTMDBTitleByIdChain"></chain><originaltitle></originaltitle><url function="ParseF
allbackTMDBRuntime" cache="tmdb-en-.json">http://api.themoviedb.org/3/movie/?api_key=57983e31fb435df4df77afb854740ea9&amp;la
nguage=en</url><runtime>

  The Birds &mdash; The Movie Database
...

I just looked at this example this morning, and I have to tell that would be hard to get this through with JAXB...
JAXB takes its input as a complete and correct (and usually Schema-described) XML document. XBMC scrapers do not really comply to the Schema-described part...

Like this runtime tag is not declared anywhere, so JAXB will puke on it.
Also the entity &mdash; is an HTML entity and is invalid in XML, unless one declares it in the Schema. But as there is no Schema, this entity stays undefined.
(One could just quote it as &amp;mdash; that way it will pass as XML, but this is something XBMC and scraper developers do not seem to bother...)
find quote