 |
|
| Computers Forum Index » Computer Languages (Perl - Modules) » [RFC] URI::URL::Detail... |
|
Page 1 of 1 |
|
| Author |
Message |
| moduler... |
Posted: Fri Aug 07, 2009 11:12 am |
|
|
|
Guest
|
Hi,
I am hoping to add a module to CPAN and I was hoping to
get some feedback/comments/ideas. The functionality of this Module is
detailed below. I am not really sure what I should call it. So far I
have the following options in mind:
WWW::Spider::URI_Detail
URI::Detail
URI::URL::Detail
The module is intended to be used as part of a web crawler although I
have found myself using parts of it elsewhere.
The basic functionality of the proposed module will include:
Given the HTML of a page
Find all anchor elements - broken into "this domain links" and
"other domain links".
Find the Title, description and other such meta data.
Split up an anchor tag into : The URL, the alt text and the anchor
text.
Given a potentially relative URL and the current URL, returns the
absolute URL.
Given a potential redirecting URL, returns the final destination URL.
Breaks up a URL into Protocol, domain and URI
"Clean" a URL so it can be used as a string ( in say Regular
expressions or MySql insert statements ).
I intend to make these functions available both independently and
together in an Object Oriented structure.
The OO part would look something like this:
my $b = new foo::bar {
CURRENT_URL => 'www.site_i_am_crawling.com/
page_i_am_crawling.html', ## New will croak if this is not provided.
FIND_CONTAINED_URLS => 1 , ## Default 1
BREAK_CONTAINED_URLS => 1 , ## Default 1
ABSOLUTE_CONTAINED_URLS => 1 , ## Default 1
CLEAN_URLS => 1 , ## Default 1
CURRENT_URL_HTML => "long string here", ## Optional, will
be extracted if this is not provided.
USER-AGENT => '' ,
TIMEOUT => 5 ,
DEBUG => 0
}
$b->get_url_info(
## Can reset object parameters here.
## All processing will be performed only when this function is
called.
);
my at (no spam) array_of_urls = $b->get_contained_urls();
ALSO for NON-OO
my at (no spam) array_of_urls = get_contained_urls( URL => '', HTML => '' );
....
my $all_results = $b->get_all_results();
The following is a list of existing CPAN modules that are similar to
the one proposed here.
WWW::Spider - Far too advanced to be used in this context.
Similar to "Find absolute"
HTML::ResolveLink
URI
URI::URL
URI::ImpliedBase
URI::SmartURI
URI::WithBase
Get the html ( and find elements )
URI::Title::HTML - No POD, gets titles only.
HTML::HeadParser - Parses only the HEAD.
HTML::TreeBuilder - Overkill?
Clean string ( for MySql, and RegEx )
CGI::Untaint - Indirect use.
Break contained URLs
URI - There are several ways to achieve this including a simple
RegEx. This functionality is included here for completeness. |
|
|
| Back to top |
|
|
|
|
|
All times are GMT
The time now is Tue Dec 01, 2009 11:03 pm
|
|