bous All American 11215 Posts user info edit post |
I am working on a project for my company...
I want to be able to use PHP to login to a few different websites and then go to specific pages and parse the HTML. What would be the best way to do this? Once I get the HTML i'm good to go.
They both use https and login using a small FORM on the login page. From there, once the session is set i'd like to browse to already known specific webpages.
PHP on Linux 5/1/2007 3:20:26 PM |
BigMan157 no u 103354 Posts user info edit post |
http://us2.php.net/manual/en/ref.curl.php ?
[Edited on May 1, 2007 at 3:46 PM. Reason : then http://us.php.net/dom to parse through it] 5/1/2007 3:43:50 PM |
bous All American 11215 Posts user info edit post |
curl! that's what i was trying to remember.
thanks! 5/1/2007 3:49:03 PM |
qntmfred retired 40726 Posts user info edit post |
i've never used the DOM functions to parse the html. i've always just formed my own regular expressions. i knew there were various html parsing libraries and functions out there, but they just seemed under-developed and kinda boxed you into what you could do. are they that much better now? are they easy to work with, are they flexible enough to correctly handle malformed html?
[Edited on May 1, 2007 at 4:34 PM. Reason : .] 5/1/2007 4:10:55 PM |
bous All American 11215 Posts user info edit post |
i'm gonna parse using regexp 5/1/2007 8:01:01 PM |
scud All American 10804 Posts user info edit post |
pagescraping is almost impossible to maintain once you're done...unless this is some sort of one-off tool or the sort I would highly suggest finding another solution to your problem. Perhaps there isn't one, just warning that it can be a real PITA 5/1/2007 8:03:05 PM |
bous All American 11215 Posts user info edit post |
it will be about 4 websites that don't really change except for certain numbers on the page 5/1/2007 10:33:17 PM |
rynop All American 829 Posts user info edit post |
I'd use PHP's HTTP request extension.
http://www.php.net/manual/en/ref.http.php
and for what your doing, specifically the HTTPRequest class (http://www.php.net/manual/en/http.HttpRequest.php)
gonna need pear tho. 5/2/2007 4:22:48 PM |
qntmfred retired 40726 Posts user info edit post |
anybody else use DOM packages to parse html? 6/29/2007 10:37:16 AM |
30thAnnZ Suspended 31803 Posts user info edit post |
ha
i was pagescraping espn.com and tsn.ca for hockey scores and news updates for a LONG time until the traffic from my server hitting it about a billion times a minute (didn't know enough php/mysql at the time to cache that shit) and they blocked my traffic
much better ways to accomplish this stuff 6/29/2007 1:30:51 PM |
qntmfred retired 40726 Posts user info edit post |
such as? 6/29/2007 4:50:45 PM |
qntmfred retired 40726 Posts user info edit post |
^^ did you have better method?
Quote : | "anybody else use PHP DOM packages to parse html?" |
7/20/2007 2:07:59 PM |
philihp All American 8349 Posts user info edit post |
a regular expression. 7/20/2007 2:22:27 PM |
qntmfred retired 40726 Posts user info edit post |
yeah, i've always used regular expressions, but the page i'm currently scraping has 7-deep nested tables with no IDs or anything and it's a PITA.
javascript style getElementsByTagName et al are so much easier to use 7/20/2007 2:38:30 PM |