Extracted function comments Sun Jun 16 10:13:56 2002 =item AdminVersion =cut =item Append =cut =item Assert Usage: #&Assert( conditional expression ); Assert is a useful debugging tool. Its one argument is a conditional that should be true in every possible case, as long as you've written your code correctly. If the argument turns out to be false at runtime, then Assert will print an error message in very large, bold letters. Often used to audit function input and output values. Possibly these Assert calls should be stripped or disabled in public releases. =cut =item Authenticate =cut =item BuildIndex Usage: &BuildIndex(); BuildIndex completely rebuilds the index for a local realm. Because the webpages in local realms are readily accessible, this function tends to process huge data sets quickly. It is self-restartable through a meta-refresh; state information is stored in the $start_pos parameter and working data is stored either in the database or the index_file.working_copy file. For file-based indexes, all new data is written to index_file.working_copy. When the process is finished, possibly after several browser requests, the original index_file is deleted and index_file.working_copy is renamed over the top of it. Thus, users are able to perform searches on the intact index_file while the BuildIndex process in progress. In addition, it is possible to safely abandon the BuildIndex process. For SQL-based indexes, we don't have that concept of a temporary storage area. Instead, each record is updated as the webpage is encountered. At the end of the BuildIndex process, if we get there, we delete all records whose lastindex time is older than "start_time". The only records older than "start_time" are those that were not detected by GetFilesByDirEx, or that were excluded for other reasons. This is an interactive function; errors and other status messages are shown to the user by printing HTML. =cut =item Cancel =cut =item Capitalize Usage: my $cap_string = &Capitalize($string); Capitalizes English-language strings. =cut =item CheckEmail Usage: my $err = &CheckEmail( $address ); if ($err) { print "

Error: $err.

\n"; } Checks whether the argument is a valid email address or not: address not blank contains text @ text text follow @ is valid hostname (can be resolved) Based on Ian Dobson's CheckEmail function. =cut =item Close =cut =item CompressStrip Process the HTML text and various subfields like Title and Description. =cut =item Crawler_new Usage my %response = $crawler->webrequest( 'page' => 'http://www.xav.com/scripts/', 'limit' => 'http://www.xav.com/', ); if ($response{'err'}) { print "

Error: $response{'err'}

\n"; exit; } print "The HTML text of this web page is:\n\n"; print $response{'text'}; =cut =item DeleteFromPending Usage: my ($err, $delcount) = &DeleteFromPending( $realm, \@urls ); =cut =item FD_Rules_new Initializes the object that manages system settings. =cut =item FlockEx Usage: if (&FlockEx( $p_filehandle, 8 )) { # okay } Abstraction layer to protect non-flock systems. =cut =item FormatDateTime =cut =item FormatNumber Usage: my $num_str = &FormatNumber( $expression, $decimal_places, $include_leading_digit, $use_parens_for_negative, $group_digits, $euro_style ); Arguments $expression Required. Expression to be formatted. $decimal_places Optional. Numeric value indicating how many places to the right of the decimal are displayed. Note: truncates $expression to $decimal_places, does not round. $include_leading_digit Optional. Boolean that indicates whether or not a leading zero is displayed for fractional values. $use_parens_for_negative Optional. Boolean that indicates whether or not to place negative values within parentheses. Style is used for outbound formatting only; inbound parsing always uses "-" for dec (Perl's internal format) $group_digits Optional. Boolean that indicates whether or not numbers are grouped using the comma. $euro_style Optional. If 1, then "." separates thousands and "," separates decimal. i.e. "800.234,24" instead of "800,234.24". Style is used for outbound formatting only; inbound parsing always uses "." for dec (Perl's internal format) Prototyped to match Microsoft's FormatNumber function for vbscript/jscript, with the limitation of not knowing about default settings. Microsoft specification at http://msdn.microsoft.com/scripting/vbscript/doc/vsfctFormatNumber.htm or from http://msdn.microsoft.com/scripting/. Error handling: if $expression is not numeric, is treated as 0 =cut =item GetAbsoluteAddress Usage: my ($abolute_url) = &etAbsoluteAddress($link_fragment, $full_url_context); For example, you spider "http://xav.com/foo/bar/index.html" and find a link to "../nikken.txt". You run: print GetAbsoluteAddress("../nikken.txt", "http://xav.com/foo/bar/index.html"); ^D http://xav.com/foo/nikken.txt =cut =item GetCrawlList Usage: my @list = (); my $count = 0; my $age = $FORM{'StartTime'}; if ($FORM{'DaysPast'}) { $age -= (86400 * $FORM{'DaysPast'}); } my $err = &GetCrawlList( $realm, $age, $max_list_size, \@list, \$count ); Retrieves a @list of all web pages in the '$realm' realm that are older than $age. $count is the size that @list would be if no limits were imposed. @list will actually contain between 0 to $max_list_size elements. The max_list_size option is available to save memory. =cut =item GetFiles_new Used to enumerated all files and folders in a certain directory. Designed to use very little memory. Files are always returned in alphabetic order, which allows certain optimizations to be made. Usage: my $fr = &fdse_filter_rules_new(); my $gf = &GetFiles_new(); $err = $gf->create_file_list( 'base_dir' => $base_dir, 'base_url' => $base_url, 'fr' => \$fr, 'tempfile' => "$file.temp", 'no_older_than' => $num_seconds, ); my $count = $gf->{'count'}; $gf->resume_file_position( $start_pos ); while (1) { my ($lastmodt, $size, $fullfile, $basefile, $url) = $gf->get_next_file(); } $gf->quit(); # kills temp file no_older_than is the number of seconds for the maximum tolerable age of the cache file. If the file exists and is older than this, then a new file will be created. =cut =item LoadRules Usage: $err = &LoadRules(); Wrapper around FD_Rules object and it's own loadrules() method. Adds additional processing. Writes directly to the global %Rules hash. Writes some derived data to %const as well. =cut =item LockFile_get_read_access Gets read access to the file. Handles the "create_if_needed" logic. Tries to restore a stale "working_copy" file if not copy of the original file exists. =cut =item LockFile_new This package provides an object-oriented approach to file I/O, with support for file locking and standardized error handling. Usage: my ($err, $obj, $p_rhandle, $p_whandle) = (); Err: { $obj = &LockFile_new( 'create_if_needed' => 1, ); ($err, $p_rhandle) = $obj->Read( $file ); next Err if ($err); while ($_ = readline($$p_rhandle)) { print $_; } $err = $obj->Close(); next Err if ($err); last Err; } continue { print "

Error: $err.

\n"; } =cut =item Merge =cut =item ParseRobotFile Usage: my @forbidden_paths = &ParseRobotFile( $RobotText, $my_user_agent ); Accepts the text of a robots.txt file, and the string name of the current HTTP user-agent. Parses through the file and returns an array of all forbidden paths that apply to the current user-agent. =cut =item PrintOrderedHash Usage: my $err = &PrintOrderedHash( \%hash, $by_value, $ascii_sort, $ascending, $date_map ); =cut =item PrintTemplate Usage: &PrintTemplate( $b_return_as_string, 'tips.html', 'german', \%replace_values, \%visited, \%cache ); See "admin_help.html" for extensive documentation on this function, its limitations, its failure scenarios, etc. =cut =item Read =cut =item ReadFile Usage: my ($err, $text) = &ReadFile($file); if ($err) { print "

Error: $err

"; } else { print "

File '$file' contains:

"; print "

$text

"; } Easy-to-call file-reading function. Calls super-robust LockFile object under the hood, which is a relatively expensive call. This is done for operations which read data from the file system into memory, and then save data back to the file system. For these operations, we cannot afford to have a single failed read operations cause permanent data loss. Examples of read failures would be "file locked for writing by another process". =cut =item ReadFileL Usage: ($err, $text) = &ReadFileL( $filename ); Returns the text of the given file, or an error. Uses direct disk I/O rather than the more expensive LockFile package. =cut =item ReadInput Reads CGI form input, or command-line parameters. Initializes %$p_FORM and assigns values. Usage: &ReadInput(); Abstracts the source of the commands (can be query string, standard input, or command-line parameters). Automatically updated global hash %FORM. =cut =item ReadWrite =cut =item Resume =cut =item SaveLinksToFileEx Usage: my $err = &SaveLinksToFileEx( $p_realm_data, $ref_crawler_results, $ref_spidered_links, $ref_links_new, $ref_links_visited_fresh, $ref_links_visited_old, $ref_links_error, ); if ($err) { print "

Error: $err.

\n"; } Saves all links from this crawl sessions to the pending pages file (search.pending.txt). File format is: URL &url_encode(realm) number where number is one of: 0 => waiting to be indexed 2 => encountered problems during index 2+ => epoch time of the index operation =cut =item SearchDatabase Searches the database. Returns the total pages searched and an array of hits by reference. =cut =item SearchIndexFile Usage: &SearchIndexFile( $index_file, $search_code, \$pages_searched, \@HITS ); Searches the given index file. Uses by-reference return values for the total pages searched and the array of hits. =cut =item SearchRunTime Usage: &SearchRunTime( $realm, $DocSearch, \$pages_searched, \@HITS ); =cut =item SelectAdEx Usage: my @Ads = &SelectAdEx( \@SearchTerms ); Returns the text for up to 4 ads, based on keywords matches with @SearchTerms. =cut =item SendMailEx Specification Lightweight, portable, Perl library for sending mail in a reliable fashion. Designed for the occassional message, not for being a massive 24x7 mailer. Requirements: absolutely zero dependencies; no external Perl modules, etc. clean: use strict, -w, -W, -T, prototypes ok callable as a single standalone function, not a package. use byref hash to optionally preserve state between calls must be able to send mail w/ raw sockets for those hosts without command-line sendmail (NT) must be able to send mail w/ command-line sendmail for those hosts without sockets privileges on port 25 (free webhosts) allow caller to specify buffered/unbuffered I/O (sysread vs read, syswrite vs print) must be very safe with user data - try really hard not to lose messages (retry, option to save to disk on socket failure, etc.) able to send mail multiple ways - sockets, |sendmail, or save-to-file must comply with "run 4ever" goal - don't overflow file system with saved messages, etc. allow verbose/debug mode which traces all socket traffic when possible, should auto-detect necessary SMTP servers - currently uses `nslookup` use extracted strings array for error messages. allow caller to import a translated set. do not write to STDOUT; do your work and return error status; let calling code deal with the user Internal Structure: Network Client Cache - %nc_cache - $p_nc_cache hash (or reference to) with: values: V:loaded = 1 or undef depending on whether these values have been queried: $$p_nc_cache{'V:PF_INET'} = PF_INET(); $$p_nc_cache{'V:SOCK_STREAM'} = SOCK_STREAM(); $$p_nc_cache{'V:PROTO'} = scalar getprotobyname('tcp'); hostnames: (all hostnames converted to lowercase) H:foo.bar.com => 4-byte IP address or undef() Usage: my $message = <<"EOM"; Hi there Bob! How has life been treating you? Regards, Joe EOM my ($err, $trace) = &SendMailEx( 'to' => 'user@host.com', 'to name' => 'Bob User', # * 'from' => 'me@host.com', 'from name' => 'Sally User', # * 'subject' => 'Hi Sally', # * 'message' => $message, 'host' => 'mail.foo.com', # * 'port' => 25, # * 'saveto' => 'e:/saved_msgs', 'max_saved_messages' => 1000, 'handler_order' => '12345', 'always_save' => 1, ); # * optional field if ($err) { print "

Error: $err.

\n"; } else { print "

Success: sent mail okay.

\n"; } print "

Here is the trace:

\n\n"; print "\n$trace\n\n"; SendMailEx knows of 2 ways to handle a message: 1. pipe the message to a process, such as /usr/sbin/sendmail or c:/blat.exe, defined with the 'pipeto' parameter If using /usr/sbin/sendmail, include the "-t" flag in the pipeto input, i.e.: 'pipeto' => '/usr/sbin/sendmail -t', 2. deliver to a known SMTP server, defined using the 'host' paramater The options are listed above in the order of speed and reliability. Saving the message to a folder is generally just a failover method to prevent the loss of user data - no message will actually be sent. By default, SendMailEx will attempt those methods in order. You can override this with the 'handler_order' parameter, which is a string like "12345" or "54321" or "23". If parameters 'pipeto', 'host', or 'saveto' aren't defined, this process will skip the handling methods which depend on them. =cut =item SetDefaults Usage: my $text = &SetDefaults( $html, \%params ); Takes $html, which is an HTML fragment including FORM elements, and sets all default attributes to match %params. Requires strict format: Generally will accept double-quoted attributes, and unquoted attributes which don't contain any embedded space. In the case of replacing "hidden"-type fields, will only insert new values for hidden form elements that do not already have a value. This code will insert CHECKED and SELECTED attributes for the appropriate form elements, but will not overwrite existing CHECKED and SELECTED attributes. The recommended way to formulate your input forms is to not use these explicit defaults. The code will overwrite default VALUE="x" values for INPUT TEXT and INPUT PASSWORD and TEXTAREA. =cut =item StandardVersion The following three functions return the HTML text for printing a single hit. &StandardVersion() returns the normal text, &AdminVersion() returns the same text as StandardVersion with the addition of "Edit" and "Delete" buttons as well as re-routing all links through the redirector Usage: my $textoutput = &StandardVersion(\@SearchTerms, %pagedata); =cut =item Suspend Used for ReadWrite activity that spans multiple object lives. Two relevant methods, Suspend and Resume. Suspend saves the read/write depth of the related files to the $filename.exclusive_lock_request file. Resume opens the files as would ReadWrite (does oppositive checks - the .elr and .tmp must exist). It seeks to the appropriate places in the files before handing the handles back. =cut =item Trim Usage: my $word = &Trim(" word \t\n"); Strips whitespace and line breaks from the beginning and end of the argument. =cut =item UpdateIndex For local realms. Update procedure used to update all records. Usage: ($err, $is_complete) = &UpdateIndex( $p_realm_data ); First call GetFiles2() to build a file of all the things. Algorithm: (Must all be done in a single process... not restartable...) Use GetFiles() to create a list of all files and their lastmod times Build a hash of $lastmod{url} = time loop through all records in the existing index unless lastmod(url) delete record next delete lastmod(url) if (lastmod(url) == lastmod_index preserve record else (file = url) =~ s!^base_url!base_dir!o; record = build_new_record(file) update record } foreach (keys %lastmod) (file = url) =~ s!^base_url!base_dir!o; record = build_new_record(file) insert record =cut =item WriteFile Usage: $err = &WriteFile( $file, $text ); This is a wrapper around the LockFile object and it's ReadWrite method. Useful for writing small text files where the entire file contents can be stored in memory ($text). =cut =item WriteRule Attempts to save the name-value pair to the Rules hash. If the $name-$value pair being assigned is already the current setting in $Rules, then this function will short-circuit and return a success result. Usage: $err = &WriteRule( $name, $value ); if ($err) { print "

Error: $err.

\n"; } =cut =item _fdr_validate Usage: my $FDR = &FD_Rules_new(); my ($is_valid, $valid_value) = $FDR->_fdr_validate($name, $value); Returns Boolean whether the rule is valid, according to the internal %defaults array. Note that $name's which are not defined in %defaults will always return as valid, with $valid_value = $value. For Boolean data types, a $value which is undefined or a null string will return $is_valid = 1 with $valid_value = 0. Returns $valid_value as either argument $value, or the onboard default. =cut =item _handle_folder Recursively-called function for gathering all the files in a folder which need to be indexed. =cut =item _load_filter_rules =cut =item add This method will check for the existence of index files; if they don't exist, it will attempt to create a zero-byte file. If the creation fails, it will not load the realm. =cut =item add_filter_rule Usage: $err = $fr->add_filter_rule(); =cut =item admin_link Usage: my $link = &admin_link( 'Action' => 'Foo', 'Name' => 'Value, ); Returns an admin URL with the passed name-value parameters. Will URL-encode the names and values. =cut =item admin_main Usage: $err = &admin_main(); =cut =item anonadd_main Function controlling visitor submissions of URL's. =cut =item basetime =cut =item check_db_config Usage: my ($err, $addr_count, $realmcount, $log_exists) = &check_db_config($verbose); if ($err) { print "

Error: your database is not configured properly.

\n"; print $err; } Returns a text error message if the database is not configured properly. =cut =item check_filter_rules TODO: document the p:, p:m:, and _udav namespaces Note: all regex passed to this subroutine are already guaranteed valid by the &validate() routine called earlier by the object. Thus no error checking is done on regex. Usage: my $url_to_get = 'http://www.xav.com/'; my $document_text = ''; my $fr = &fdse_filter_rules_new(); my ($is_denied, $requires_approval, $promote_val, $filter_err, $no_update_on_redirect, $b_index_nofollow, $b_follow_noindex) = (); ($is_denied, $requires_approval, $promote_val, $filter_err, $no_update_on_redirect, $b_index_nofollow, $b_follow_noindex) = $fr->check_filter_rules( $url_to_get, '', 1); if ($is_denied) { print "

URL '$url_to_get' is denied - $filter_err

"; exit; } $document_text = get( $url_to_get ); ($is_denied, $requires_approval, $promote_val, $filter_err, $no_update_on_redirect, $b_index_nofollow, $b_follow_noindex) = $fr->check_filter_rules( $url_to_get, $document_text, 0); if ($is_denied) { print "

URL '$url_to_get' is denied - $filter_err

"; exit; } if ($requires_approval) { #queue } else { # add to index } =cut =item check_parse_patterns Usage: &check_parse_patterns( $text, \%metadata ); =cut =item check_regex Usage: $err = &check_regex($pattern); Checks against ?{} code-executing expressions. Uses an eval wrapper to confirm that the expression is valid. =cut =item check_rule =cut =item clean_path Usage: $clean_path = &clean_path( $path ); Function for stripping garbage from web page paths. It will collapse "." and ".." paths, collapse stacked /// slashes, and strip pound links. Examples: "/foo/../bar/index.htm" => "/bar/index.htm" "/test.htm#top" => "/test.htm" "/../foo/bar" => "/foo/bar" "////top//level/../no_this/./file" => "/top/no_this/file" This is used to cleanse links discovered in user input or in web pages that crawler visits. It is also used to clean forbidden paths in the robots.txt files (by cleaning both the original URL and the exclusion paths with the same function, we minimize risk of hitting an exluded path.) Updated 2002-06-05 =cut =item clear_error_cache Usage: ($err, $error_lines) = &clear_error_cache(); Attempts to remove all cached error pages from file "search.pending.txt". Return $err on failure, and integer $error_lines on success. =cut =item compress_hash Usage: &compress_hash( \%pagedata ); This function is solely responsible for initiating any time fields that haven't been set yet. Time fields are: lastindex, lastmodtime, dd, yyyy, mm =cut =item convert_pdf_to_text Usage: ($err, $content_type, $text) = &convert_pdf_to_text( $pdf_body ); Attempts to convert a PDF binary stream into a readable text stream, by shelling out to the xpdf toolkit. =cut =item create_conversion_code Usage: my $code = &create_conversion_code( $b_verbose ); Creates a block of Perl code (for later use in eval()) which will: 1. convert HTML entities to the appropriate byte in the Latin-1 character set 2. converts characters based on the accent sensitivity and case sensitivity settings under Character Conversion 3. strips any remaining non-word characters When the $b_verbose flag is set, an HTML table will be printed which shows all characters, their word/non-word status, and the values that they will be converted to. =cut =item create_db_config Usage: my $err = create_db_config($overwrite, $verbose); if ($err) { print "

Error: unable to create database configuration.

\n"; print $err; } Attempts to create an FDSE database. If $overwrite is true, then will overwrite existing data. Returns an HTML multi-error message if the database cannot be created. =cut =item create_file_list =cut =item create_sql_log Usage: my $err = &create_sql_log(); Creates a SQL table in the database defined, that is used to store the terms searched by visitors. =cut =item db_exec Executes a single one-line SQL statement. =cut =item delete_filter_rule Deletes the filter rule '$name' from the internal array, and then saves the filter rules to disk. Usage: my $err = $FR->delete_filter_rule( $name ); if ($err) { print "

Error: $err.

\n"; } =cut =item delete_index_file Usage: &delete_index_file( $realm_file ); Attempts to delete the index file and all associated files. Prints error information to output. =cut =item fdse_filter_rules_new Usage: my $FR = &fdse_filter_rules_new(); Returns the object for managing Filter Rules. Each filter rule is a hash of name-value pairs, include the p_strings => \@strings pair and the litstrings pair. Lookup of filter rules is by name on the $FR hash itself, like $p_data = $FR->{'Admin Pages'}. Any hash element in $FR which is a hash reference is treated as a filter rule. =cut =item fdse_realms_new Note that the SQL column "is_runtime" has been overloaded to mean "type". Done so that ppl don't have to rebuild their databases as I add new realm types. This'll be changed when I next break with reverse compat. =cut =item format_term_ex Usage: my ($type, $is_attrib_search, $str_pattern, $sql_clause) = &format_term_ex($user_entered_term, $default_type); Returns: $type of 0 == ignored, 1 == forbidden, 2 == optional, 3 == required $is_attrib_search is 1 iff the term is like "title:foo" or "link:xav.com". $str_pattern is the pattern to put against the Record to test for existence $sql_clause is suitable for insertion in "SELECT * FROM $Rules{'sql: table name: addresses'} WHERE ($sql_clause) AND ($sql_clause)" examples: text LIKE '%foo%' or ut LIKE '%my phrase%' =cut =item freeh Free file handle. Unlocks the handle with flock() and then closes. Returns last error. =cut =item frwrite Saves the filter rules to their file. Usage: $err = $FR->frwrite(); if ($err) { print "

Error: $err.

\n"; } =cut =item get_absolute_url =cut =item get_age_str Usage: $age_str = &get_age_str( time() - $lastmodt ); =cut =item get_dbh Creates an open database connection using the byref parameter. Returns an error string on failure. Usage: my $err = &get_dbh( \$dbh ); if ($err) { print "

Error: $err.

\n"; } =cut =item get_default_name Usage: my ($defname, $deffile) = $realms->get_default_name( $base_url ); =cut =item get_defaults =cut =item get_next_file =cut =item get_open_realm Usage: my ($err, $p_realm_data) = $realms->get_open_realm() } Returns a realm object for the first open-style realm (type == 1). If no open realms are defined, will create one and return a pointer to it, or an error regarding the failure to create a realm. =cut =item get_remote_host Usage: $hostname = &get_remote_host(); This subroutine will attempt to lookup a resolved hostname from the REMOTE_HOST environment variable. If none is found, or if it appears to be an IP address, then the $private{'visitor_ip_addr'} will be resolved to a hostname and returned. Uses global hash key $private{'remote_host'} as a hidden cache. =cut =item get_web_folder Usage: my $url = &get_web_folder($url); Takes a URL and reduces it to the folder descriptor: http://www.xav.com => http://www.xav.com/ http://www.xav.com/~bob => http://www.xav.com/~bob/ http://www.xav.com/~bob/index.html => http://www.xav.com/~bob/ =cut =item get_website_realm Usage: my ($err, $p_realm_data) = $realms->get_website_realm( $url ) Returns a realm object for the first website-style realm with base_url that matches to $url. If no such website-realms exist, it will try to create one. If it fails, an error message will be returned. =cut =item get_wname =cut =item hashref Provides quick access to a hash containing all the information about a realm. Usage: my ($err, $p_realm_data) = $realms->hashref( 'foo' ); if ($err) { print "

Error: $err.

\n"; } =cut =item html_encode Usage: my $html_str = &html_encode($string); Formats string consistent with embedding in an HTML document. Escapes the \"><& characters. =cut =item html_select_ex Usage: my ($count, $html) = $realms->html_select_ex( $attrib, $default, $class, $width1 ); =cut =item leadpad Usage: my $buffer = &leadpad( "foo", "0", 10 ); returns "0000000foo" =cut =item leansock Usage: $err = &leansock($host,$port,\*GLOBFILE,$p_nc_cache); Attempts to create and connect an unbuffered socket to $host:$port, referenced by *GLOBFILE. Hash reference to %nc_cache holds socket values and cached DNS lookups. Does not call getservbyname() because protocol is not generally know. Expects explicit port; if you want to be psycho and ask an api for the port number, do so on your own before calling. During benchmarks on Win2000 2x550MHz, basic Perl loop w/ 10^4 iterations of simple string assignment executed in about 2.39 seconds. With 1 iteration, took 1.65 seconds. With a call to "use Socket" followed by 10^4 iterations, took 2.88 seconds. Suggests that basic Perl interpreter initialization cost of 1.65 seconds with additional 0.49 second when "use Socket" called (+33%). For systems where initial read from text data file is pre-requisite anyway, may pay off to keep a short-term cache of static return values for Socket functions. =cut =item list_filter_rules my @rules = $indexrules->list_filter_rules() foreach $p_rule (@rules) { my %rule = %$p_rule; $rule{'name'} $rule{'action'} $rule{'occurences'} $rule{'promote_val'} my $p_string = $rule{'p_string'}; foreach (@$p_string) { } =cut =item list_system_rules =cut =item listrealms Usage: my @realms = $realms->listrealms('all'); Returns an array of references to all realms which match the attribute parameter. =cut =item load Usage: my $realms = &fdse_realms_new(); my $err = $realms->load(); if ($err) { print "

Error: $err.

\n"; } =cut =item load_custom_metadata Usage: $err = &load_custom_metadata( $url, \%metadata ); next Err if ($err); =cut =item load_desc =cut =item load_files_ex Usage: my $err = &load_files_ex( $support_dir ); This function attempts to load all the script-specific data from files. Sequence: require's common.pl uses common.pl to call &ReadInput to process user commands based on user's commands, may require common_parse_page.pl and/or common_admin.pl changes directory to data folder loads strings loads realms loads rules Failures with any of these actions are considered fatal errors, and the return values are set appropriately. =cut =item load_pics_descriptions Usage: my (@pics_codes, @pics_names, @pics_values) = (); $err = &load_pics_descriptions( 'RASCi', \@pics_codes, \@pics_names, \@pics_values ); next Err if ($err); =cut =item log_search Usage: my $err = &log_search( $realm, $terms, $rank, $documents_found, $documents_searched ); Where: $realm == the realm name; 'All' for cases where the realm hasn't been specified $terms * == the literal string that the user typed in. $rank == the starting number in displaying hits. will be 1 for first search, 11 for "Next", 21 for "Next" after that, etc. used to calculate the depth that visitors go in searching for data $documents_found == integer; total documents matching $terms. in theory $ranks <= $documents_found * when writing to the log, any commas or line breaks will be stripped from the Terms. Also, they will be &html_encode'd so "<" => "<" etc. The function internally looks up the visitor IP/hostname and the current time. The $err is typically discarded (no reason to frighten visitors) =cut =item migrate_log Usage: &migrate_log( 'search.log.txt' ); Migrates a text log from the version before 2.0.0.0029 to the newer version. Handles cases where the text logfile contains a mix of old and new records. Writes status and error handling text to stdout. The entire function is wrapped in an eval statement to protect against Time::Local not being available, or Time::Local trying to kill the process. =cut =item pagedata_from_file Usage: ($err, $url) = &pagedata_from_file( $file, $URL, \%pagedata, \$fr ); $fr is an initialized filter rules object (passed by reference between calls to pagedata_from_file for efficiency. =cut =item parse_meta_header Usage: my $value = &parse_meta_header(\$text, 'meta_name'); Returns the text of the CONTENT attribute for the first META tag whose NAME matches the second parameter. Example: my $text = ""; my $value = &parse_meta_header(\$text, "robots"); # $value = 'none' As an optimization: only searches the first 4096 bytes of $text requires that the NAME= or HTTP-EQUIV= attribute be the very first in the tag with CONTENT= following somewhere, OR that the NAME= attrib be last AND that the CONTENT= attrib be first Returns an empty string if no matching META tag is found. =cut =item parse_pics_label Usage: my ($is_denied, $require_approval, $err) = $self->parse_pics_label( $text ); Determines whether there is a PICS meta tag in the HTML $text supplied. If there is, and if this script is concerned with PICS (as evidenced by the appropriate %Rules), then it parses the tag and compares values to the %Rules maximums. If it finds that the document will $require_approval, it notes this and continues parsing. If it finds that text document $is_denied, it exits immediately. The $err contains information about the final rule violated. =cut =item parse_search_terms Usage: my ($bTermsExist, $Ignored_Terms, $Important_Terms, $DocSearch, $RealmSearch, $where_clause, @SearchTerms) = &parse_search_terms( $FORM{'terms'}, $FORM{'match'} ); This function takes the user's search terms and builds a set of regular expressions that can be used to parse the index files. Also builds a SQL select statement that will select the proper records. =cut =item parse_text_record Usage: ($is_valid, %pagedata) = &parse_text_record( $textline ); Converts a line of text from an index file into a pagedata hash. =cut =item parse_url_ex Usage: my ($err, $clean_url, $host, $port, $path) = &parse_url_ex($url); =cut =item pppstr Usage: &pppstr(100, $!, $^E); This is the Paragraph-Print Parse String function. =cut =item ppstr Usage: &ppstr(100, $!, $^E); This is the Print Parse String function. =cut =item present_queued_pages Usage: &present_queued_pages( $realm ); Displays a list of all pages waiting for approval. =cut =item print_realm_table_header Prints the TH row. =cut =item print_realm_table_row Prints realm information and commands. =cut =item process_queued_pages Handles the user's Approve/Deny/Wait commands against the list of waiting pages. =cut =item process_text Usage: my ($err, $allow_follow, $is_redirect, $full_redir_url, $index_as, $lastmodt) = &process_text( \$text, $url, $b_is_binary ); =cut =item pstr Usage: my $string = &pstr(100, $!, $^E); This is the Parse String function. The first argument is the line number from strings.txt from which to pull the template string. All remaining strings in the argument list are substituted as $s1, $s2, $s3, etc., in the template string. =cut =item query_database &query_realm implementation for database-based indexes. =cut =item query_env Usage: $value = &query_env('SCRIPT_NAME'); =cut =item query_file &query_realm implementation for file-based indexes =cut =item query_realm Usage: $err = &query_realm( $realm, $url_pattern, $start_pos, $max_results, \%crawler_results ); if ($err) { print "

Error: $err.

\n"; } =cut =item query_runtime &query_realm implementation for runtime realms =cut =item quit Usage: $err = $gf->quit($b_save_file); next Err if ($err); Closes the cache filehandle, and deletes the file (unless $b_save_file is set). =cut =item raw_get Abstraction layer for choosing between &raw_get_raw and &raw_get_alarm =cut =item raw_get_alarm Same as &raw_get(), but wrapped with a Unix alarm to protect against unresponsive hosts. =cut =item raw_get_raw raw_get_raw makes the actual socket-level request. The higher-level webreqest function handles robots exclusion and redirects. =cut =item read_tokens Returns the hash of auth_tokens from the tokens file. Usage: ($err, %tokens) = &read_tokens(); if ($err) { print "

Error: $err.

\n"; } =cut =item realm_count Usage: my $int_realms = $realms->realm_count('all'); my $int_bound_realms = $realms->realm_count('has_base_url'); Returns an integer for the number of realms that match the attribute passed as an argument. If not attribute is passed, returns the total number of realms. =cut =item realm_interact Usage: my %code = (); &realm_interact( $p_realm_data, $Rules{'sql: enable'}, \%code ); Assumes my ($i_url, $i_lastmodt, $i_record, %pagedata, $write_err) = () use $i_line to seek for a resume operation $i_line is also incremented with the record count, during operations, for use suspend/resume operations standard Err block handling Returns $code{'init'} $code{'resume'} $code{'suspend'} $code{'abort'} $code{'finish'} $code{'get_next'} assigns to ($i_url, $i_lastmodt, $i_record) $code{'update'} writes based on $i_url / %pagedata $code{'insert'} writes based on %pagedata $code{'preserve'} ($i_url / $i_record) $code{'delete'} ($i_url) =cut =item rebuild_realm Usage: my ($err, $is_complete) = &rebuild_realm( $realm ); Attempts to rebuild the realm. Does The Right Thing based on the type of realm we're dealing with. =cut =item regkey_validate Usage: $is_valid = ®key_validate( $Rules{'regkey'} ); =cut =item regkey_verify Usage: ®key_verify(); Returns FDSE version, administrator last-login time, Freeware/Trial/Registered mode, and registration key. =cut =item remove Usage: $realms->remove( $name, $permanent ); No error handling -- this just modifies the in-memory copy, it doesn't persist to disk. =cut =item resume_file_position Usage: $gf->resume_file_position($pos); Treats $pos == 0 as start position, so an argument of 0 will cause nothing to happen. =cut =item rewrite_url =cut =item s_AddURL Usage: $err = &s_AddURL($b_IsAnonAdd, $Realm, @AddressesToIndex); This is the main function for adding web pages to the realms, both for administrators and anonymous visitors. Internally handles the crawling, error handling, HTML parsing, and storage. If any error occurs, then s_AddURL will handle it by printing to the screen. However, it will also return a copy of the last error experienced, for use by routines which programmatically call s_AddURL, like s_CrawlEntireSite. =cut =item s_CrawlEntireSite Usage: my ($err, $is_complete) = &s_CrawlEntireSite( $realm ); =cut =item s_create_edit_rule Usage: $err = &s_create_edit_rule(); Presents the HTML form for creating or editing a Filter Rule. Handles submission of that form as well. Error handling: returns a localized text error fragment if there is a problem. Otherwise writes status to the screen. =cut =item save_custom_metadata Usage: $err = &save_custom_metadata( $url, %metadata ); next Err if ($err); Call with an undefinited second parameter to delete the entry. =cut =item save_realm_data Usage: my $err = $realms->save_realm_data(); Takes the current $realms object and persists it to the associated file. Returns the error/success of the operation. Since save_realm_data is typically called whenever state has changed, this method also flushes all caches. =cut =item sendmail_build_raw_message =cut =item sendmail_datetime Usage: $time_str = &sendmail_datetime($time_int); =cut =item sendmail_socket Attempts to send an email message through the specified SMTP gateway. Returns $err if something goes wrong. Returns $trace of all socket activity regardless. =cut =item setpagecount Usage: $name = "My Realm"; $n_pages = 1000; print "

Now there are $n_pages pages in realm '$name'!

\n"; $err = $realms->setpagecount($name, $n_pages); if ($err) { print "

Error: $err.

\n"; } =cut =item str_jumptext Usage: my ($jump_sum, $jumptext) = &str_jumptext( $current_pos, $units_per_page, $maximum, $url, $b_is_exact_count ); $jump_sum = "Documents 1-10 of 15 displayed." $jumptext = "

<- Previous 1 2 3 4 5 Next ->

" Everything is 1-based. =cut =item str_search_form Usage: my $html = &str_search_form( $url ); Returns the text of a search form whose FORM ACTION attribute points to $url. Based on 'searchform.htm' template. Uses variable $url because, internally, we use safer relative URL's. For exporting the search form to other sites, though, we need to be able to create the search form with an absolute URL. =cut =item text_record_from_hash Creates a textfile record out of the constituent fields. Usage: my ($err, $text_record) = &text_record_from_hash(\%pagedata); =cut =item timegm Usage: my %timecache = (); $time = &timelocal($sec,$min,$hours,$mday,$mon,$year,\%timecache); $time = &timegm($sec,$min,$hours,$mday,$mon,$year,\%timecache); Arguments: $mday is human time, i.e. 1..31 $mon is computer time, i.e. 0..11 $mon can be a text string like "JUN" or "JUL" $year should be 4-digit; if less than 999, some sort of algorithm will force a 4-digit year. These routines were taken from the Time::Local module. They have been extracted into small functions so that they can be safely called from platforms that due not have the Time::Local modules install. Also, the error handling has been changed so that it never croaks (what were they smoking when they designed it that way?). Caching has been cleaned up and made optional. Error Handling: Will return 0 if unable to handle the input values. Will return 0 if out-of-band year (less than 1970 or more than 2037) All other range checking has been removed. =cut =item timelocal =cut =item ui_AdminPage Usage: &ui_AdminPage(); Default view into the search engine. =cut =item ui_DataStorage =cut =item ui_DeleteRecord Usage: &ui_DeleteRecord(); DeleteRecord provides an interactive HTML interface for record deletions. It allows: record deletion based on Realm and URL(s) querying for multiple records based on URL patterns It is primarily called from the AdminVersion output. It can also be called by itself, for pattern-deletes. if $realm and $query_pattern DeleteRecord will search $realm for all records which match $query_pattern. They are shown to the user, who can then choose whether to delete all those records or not else if $realm and @urls_to_delete DeleteRecord will try to delete all the records by calling update_realm else DeleteRecord will offer a delete interface - browse realm or select realm, type in URL to delete In $query_pattern, ".*" will be mapped to "%" for SQL queries. Because the @url_patterns may be handed off to SQL, only .* can be used safely. .* will be mapped to % for SQL queries. However, other Perl regular expressions will be passed through, so enhanced Perl expressions (or SQL expressions) can still be leveraged if the user knows about the underlying data storage system. Code-executing regular expressions using ?{} will be stripped for security. =cut =item ui_FilterRules This function handles the admin user interface for managing filter rules. Usage: &ui_FilterRules(); Error handling is done by printing HTML to the end user. =cut =item ui_GeneralRules Usage: &ui_GeneralRules( $action_name, $action_value, @settings ); Displays the settings from the %Rules array, and the descriptions for each settings. Allows validated edits for each setting based on datatype. In general, the %Rules architecture should be replaced with an array. Using an English-keyed hash is hard to translate, and also uses more memory. =cut =item ui_License Usage: &ui_License(); Allows users to select one of three license modes: Freeware, Trial Shareware, and Registered Shareware. Allows user to input registration key. =cut =item ui_ManageAds This prints the admin view HTML for controlling advertisements. It also handles the action of the forms on this UI, including changing positions, defining new ads, and reset usage data. =cut =item ui_ManageRealms Usage: &ui_ManageRealms(); Presents the HTML form used to define a new realm, or to customize an existing realm. =cut =item ui_PersonalSettings Usage: &ui_PersonalSettings(); Controls email settings, password, security, etc. =cut =item ui_Rebuild Usage: &ui_Rebuild(); Attempts to rebuild the given realm. =cut =item ui_ReviewIndex Usage: &ui_ReviewIndex(); This function prints out the AdminVersion line listings for up to $max_results_to_show in the given realm, starting at $start_pos. Mainly a wrapper around &query_realm(). Dependences: $realms %FORM %const %str &AdminVersion &query_realm &str_jumptext &url_encode TODO: standardize that search interface that DeleteRecord ended up. just have a standard query interface =cut =item ui_Rewrite Manages the URL-rewriting patterns =cut =item ui_UserInterface Usage: &ui_UserInterface(); Handles entire process of editing user-interface specific settings. =cut =item ui_ViewStats Usage: &ui_ViewStats(); Provides full user interface for viewing search log. All error handling is done via HTML presented to the user; no errors are returned. =cut =item update_database Usage: ($err, $entry_count, $duplicates) = &update_database( $realm, \%crawler_results ); Applies updates to the database, based on the change requests stored in %crawler_results. =cut =item update_file Usage: my ($err, $entry_count, $duplicates) = &update_file( $realm, \%crawler_results ); if ($err) { print "

Error: $err.

\n"; } =cut =item update_realm Incorporates the results of a crawl - stored in the %crawler_results hash - into the underlying storage container for $realm. Includes adding new records, updating existing records, and deleting expired records. Usage: my ($err, $total_records, $new_records, $updated_records, $deleted_records) = update_realm( $realm, \%crawler_results ); if ($err) { print "

Error: $err.

\n"; } else { print "

There are now $total_records web pages in the '$realm' realm - $new_records records created; $updated_records updated; $deleted_records removed.

\n"; } =cut =item url_decode =cut =item url_encode Usage: my $str_url = &url_encode($string); Formats strings consistent with RFC 1945 by rewriting metacharacters in their %HH format. =cut =item use_database Usage: $realms->use_database( 1 ); Sets the $self->{'use_db'} scalar. Useful for data migration. Example: # Loads all realm data from file; saves all data to database: $realms->use_database(0); # now using file $realms->load(); # all realm data is currently in memory $realms->use_database(1); # now using database $realms->save_realm_data(); # just wrote all the data to the database =cut =item validate This function takes all the parameters that could make up a filter rule, and determines whether they are valid or not. Returns a text error message if the rule would not be valid. Usage: $err = $FR->validate($enabled, $name, $action, $promote_val, $analyze, $mode, $occurrences, $apply_to, $apply_to_str, $p_strings, $p_litstrings); if ($err) { print "

Error: $err.

\n"; } =cut =item webrequest Handles high-level HTTP request. =cut =item write_tokens Saves the %tokens hash to the auth tokens file. Usage: $err = &write_tokens(%tokens); if ($err) { print "

Error: $err.

\n"; } =cut