2009-06-07, 17:18
Me again,
Some issues with HTML decode, you can test it (re)scraping "Foor Rooms (1995)" .. Studio will get some unDecode HTML. (same will happen for Director a Writer in other movies)
diif patch
BTW: do you want me to report in here or in the google issue?
Some issues with HTML decode, you can test it (re)scraping "Foor Rooms (1995)" .. Studio will get some unDecode HTML. (same will happen for Director a Writer in other movies)
diif patch
Code:
Index: clsScrapeIMDB.vb
===================================================================
--- clsScrapeIMDB.vb (revision 258)
+++ clsScrapeIMDB.vb (working copy)
@@ -481,7 +481,7 @@
'got any director(s) ?
If D > 0 AndAlso Not W <= 0 Then
'get only the first director's name
- Dim rDir As MatchCollection = Regex.Matches(HTML.Substring(D, W - D), HREF_PATTERN)
+ Dim rDir As MatchCollection = Regex.Matches(Web.HttpUtility.HtmlDecode(HTML.Substring(D, W - D)), HREF_PATTERN)
Dim Dir = From M As Match In rDir Where Not M.Groups("name").ToString.Contains("more") _
Select Web.HttpUtility.HtmlDecode(M.Groups("name").ToString)
@@ -593,7 +593,7 @@
If D > 0 Then W = HTML.IndexOf("</ul>", D)
If D > 0 AndAlso W > 0 Then
'only get the first one
- Dim Ps = From P1 As Match In Regex.Matches(HTML.Substring(D, W - D), HREF_PATTERN) _
+ Dim Ps = From P1 As Match In Regex.Matches(Web.HttpUtility.HtmlDecode(HTML.Substring(D, W - D)), HREF_PATTERN) _
Where Not P1.Groups("name").ToString = String.Empty _
Select Studio = P1.Groups("name").ToString Take 1
IMDBMovie.StudioReal = Ps(0).ToString.Trim
@@ -602,7 +602,7 @@
D = HTML.IndexOf("<h5>Company:</h5>")
If D > 0 Then W = HTML.IndexOf("</div>", D)
If D > 0 AndAlso W > 0 Then
- IMDBMovie.StudioReal = Regex.Match(HTML.Substring(D, W - D), HREF_PATTERN).Groups("name").ToString.Trim
+ IMDBMovie.StudioReal = Regex.Match(Web.HttpUtility.HtmlDecode(HTML.Substring(D, W - D)), HREF_PATTERN).Groups("name").ToString.Trim
End If
End If
End If
@@ -616,7 +616,7 @@
D = HTML.IndexOf("<h5>Writer")
If D > 0 Then W = HTML.IndexOf("</div>", D)
If D > 0 AndAlso W > 0 Then
- Dim q = From M As Match In Regex.Matches(HTML.Substring(D, W - D), HREF_PATTERN) _
+ Dim q = From M As Match In Regex.Matches(Web.HttpUtility.HtmlDecode(HTML.Substring(D, W - D)), HREF_PATTERN) _
Where Not M.Groups("name").ToString = "more" _
AndAlso Not M.Groups("name").ToString = "(more)" _
AndAlso Not M.Groups("name").ToString = "(WGA)" _
@@ -636,7 +636,7 @@
D = HTML.IndexOf("Directed by</a></h5>")
If D > 0 Then W = HTML.IndexOf("</body>", D)
If D > 0 AndAlso W > 0 Then
- Dim qTables As MatchCollection = Regex.Matches(HTML.Substring(D, W - D), TABLE_PATTERN)
+ Dim qTables As MatchCollection = Regex.Matches(Web.HttpUtility.HtmlDecode(HTML.Substring(D, W - D)), TABLE_PATTERN)
For Each M As Match In qTables
'Producers
BTW: do you want me to report in here or in the google issue?