Main Page | Report this Page
Computers Forum Index  »  Computer Languages (Perl - Modules)  »  [RFC] URI::URL::Detail...
Page 1 of 1    

[RFC] URI::URL::Detail...

Author Message
moduler...
Posted: Fri Aug 07, 2009 11:12 am
Guest
Hi,

I am hoping to add a module to CPAN and I was hoping to
get some feedback/comments/ideas. The functionality of this Module is
detailed below. I am not really sure what I should call it. So far I
have the following options in mind:

WWW::Spider::URI_Detail
URI::Detail
URI::URL::Detail



The module is intended to be used as part of a web crawler although I
have found myself using parts of it elsewhere.




The basic functionality of the proposed module will include:

Given the HTML of a page
Find all anchor elements - broken into "this domain links" and
"other domain links".
Find the Title, description and other such meta data.
Split up an anchor tag into : The URL, the alt text and the anchor
text.
Given a potentially relative URL and the current URL, returns the
absolute URL.
Given a potential redirecting URL, returns the final destination URL.
Breaks up a URL into Protocol, domain and URI
"Clean" a URL so it can be used as a string ( in say Regular
expressions or MySql insert statements ).



I intend to make these functions available both independently and
together in an Object Oriented structure.

The OO part would look something like this:

my $b = new foo::bar {

CURRENT_URL => 'www.site_i_am_crawling.com/
page_i_am_crawling.html', ## New will croak if this is not provided.
FIND_CONTAINED_URLS => 1 , ## Default 1
BREAK_CONTAINED_URLS => 1 , ## Default 1
ABSOLUTE_CONTAINED_URLS => 1 , ## Default 1
CLEAN_URLS => 1 , ## Default 1

CURRENT_URL_HTML => "long string here", ## Optional, will
be extracted if this is not provided.

USER-AGENT => '' ,
TIMEOUT => 5 ,

DEBUG => 0

}

$b->get_url_info(

## Can reset object parameters here.
## All processing will be performed only when this function is
called.

);


my at (no spam) array_of_urls = $b->get_contained_urls();



ALSO for NON-OO

my at (no spam) array_of_urls = get_contained_urls( URL => '', HTML => '' );
....



my $all_results = $b->get_all_results();







The following is a list of existing CPAN modules that are similar to
the one proposed here.


WWW::Spider - Far too advanced to be used in this context.


Similar to "Find absolute"
HTML::ResolveLink
URI
URI::URL
URI::ImpliedBase
URI::SmartURI
URI::WithBase


Get the html ( and find elements )
URI::Title::HTML - No POD, gets titles only.
HTML::HeadParser - Parses only the HEAD.
HTML::TreeBuilder - Overkill?


Clean string ( for MySql, and RegEx )
CGI::Untaint - Indirect use.



Break contained URLs
URI - There are several ways to achieve this including a simple
RegEx. This functionality is included here for completeness.
 
 
Page 1 of 1    
All times are GMT
The time now is Tue Dec 01, 2009 11:03 pm