From g.bartol@comune.prato.it Fri Jan 21 08:43:07 2000 Date: Fri, 21 Jan 2000 13:08:41 +0100 From: Gabriele Bartolini To: htdig3-dev@htdig.org Subject: [htdig3-dev] Final Patch for Retriever stuff Ciao amici, I have been testing my patch last night (obviously in automatic ;-) and this morning and I have come up to good results. The retrieving system now works fine, either with persistent_connections activated or not. I also tried the head_before_get attribute. And it seems to me to be OK. Obviously I am not completely sure because I haven't been managing that code for a couple of months. That's why I wanna wait to COMMIT the changes. But, as I wrote the HtHTTP and Transport code, I think that 'logically' it works. And it does it wonderfully in my environment (11 web servers, with about 10000 documents: the whole process took 2 hours and 10 minutes with persistent_connections activated and head_before_get 'on'). I also modified the Retriever code for showing HTTP connections stats at the end of htdig if '-s' options been chosen. Here are some interesting results: I run htdig 3 times in a restricted area of my environment, by indexing the following sites, this way: htdig: Run complete htdig: 4 servers seen: htdig: balwww.comune.prato.it:80 53 documents htdig: search.comune.prato.it:80 1 document htdig: sportelloamico.po-net.prato.it:80 3 documents htdig: www.po-net.prato.it:80 428 documents This is the first result, with pcs and no head before get: HTTP statistics =============== Persistent connections: Yes HEAD call before GET: No Connections opened : 92 Connections closed : 91 Changes of server : 3 HTTP Requests : 491 HTTP KBytes requested : 2018,69 HTTP Average request time : 0,01222 secs HTTP Average speed : 336,448 KBytes/secs Here's the second, with both the options activated: Persistent connections: Yes HEAD call before GET: Yes Connections opened : 17 Connections closed : 16 Changes of server : 3 HTTP Requests : 909 HTTP KBytes requested : 2113,91 HTTP Average request time : 0,00660066 secs HTTP Average speed : 352,318 KBytes/secs And here's the traditional way of retrieving test (no persistent conn.): HTTP statistics =============== Persistent connections: No Connections opened : 489 Connections closed : 489 Changes of server : 110 HTTP Requests : 489 HTTP KBytes requested : 2018,65 HTTP Average request time : 0,0163599 secs HTTP Average speed : 252,331 KBytes/secs Obviously I can take advantage (strongly) of persistent connections in a "close" environment where I have a few sites to index (and so I have a few "server changes", that need connections to be closed/opened serveral times). But it's all CONFIGURABLE, isn't it? So ... No problem !!! Let me have your opinion on this result. And overall try the patch in the week-end so I can modify it at the beginning of the week (if something goes wrong - cross my fingers) or - I HOPE - commit it to the cvs tree. Then let me know if you want to raise the new configuration attribute for limiting the number of consecutive requests on the same server if persistent connections are on (and maybe, propose a name). Ah ... I vote +1 for this configuration attribute and please Geoff suggest a name. I say Ciao to everyone cos I stop working now ... I am getting ready to dive into a new wonderful week-end !!! Have a nice week-end and don't work too much !!! And ... I hope Juventus is going to win, maybe with a goal of Zidane !!! OK, Gab, stop it !!! Ciao -Gabriele [ Part 2: "Attached Text" ] [The following text is in the "iso-8859-1" character set] [Your display is set for the "US-ASCII" character set] [Some characters may be displayed incorrectly] Index: Document.h =================================================================== RCS file: /opt/htdig/cvs/htdig3/htdig/Document.h,v retrieving revision 1.10.2.5 diff -3 -u -p -r1.10.2.5 Document.h --- Document.h 2000/01/14 01:23:43 1.10.2.5 +++ Document.h 2000/01/21 11:49:41 @@ -74,6 +74,8 @@ public: // void setUsernamePassword(const char *credentials) { authorization = credentials;} + + HtHTTP *GetHTTPHandler() { return HTTPConnect; } private: enum Index: Retriever.cc =================================================================== RCS file: /opt/htdig/cvs/htdig3/htdig/Retriever.cc,v retrieving revision 1.72.2.15 diff -3 -u -p -r1.72.2.15 Retriever.cc --- Retriever.cc 2000/01/20 03:55:47 1.72.2.15 +++ Retriever.cc 2000/01/21 11:49:48 @@ -26,6 +26,7 @@ #include "StringList.h" #include "WordType.h" #include "Transport.h" +#include "HtHTTP.h" // For HTTP statistics #include #include @@ -297,43 +298,83 @@ Retriever::Start() while (more && noSignal) { - more = 0; + more = 0; // - // Go through all the current servers in sequence. We take only one - // URL from each server during this loop. This ensures that the load - // on the servers is distributed evenly. + // Go through all the current servers in sequence. + // If they support persistent connections, we keep on popping + // from the server queue until we reach a maximum number of + // consecutive requests (so we will probably have to issue a new + // attribute, like "server_repeat_connections"). Or the loop may + // continue for the infinite, if we set the max to -1 (and maybe + // the attribute too). + // If the server doesn't support persistent connection, we take + // only an URL from it, then we skip to the next server. // + + // Let's position at the beginning servers.Start_Get(); + + int count; + + // Maximum number of repeated requests with the same + // socket connection. + int max_repeat_requests; + while ( (server = (Server *)servers.Get_NextElement()) && noSignal) { if (debug > 1) - cout << "pick: " << server->host() << ", # servers = " << + cout << "pick: " << server->host() << ", # servers = " << servers.Count() << endl; - ref = server->pop(); - if (!ref) - continue; // Nothing on this server - // There may be no more documents, or the server - // has passed the server_max_docs limit + // and if the Server doesn't support persistent connections + // turn it down to 1. - // - // We have a URL to index, now. We need to register the - // fact that we are not done yet by setting the 'more' - // variable. - // - more = 1; + // We already know if a server supports HTTP pers. connections, + // because we asked it for the robots.txt file (constructor of + // the class). + + if (server->IsPersistentConnectionAllowed()) + // Once the new attribute is set + // max_repeat_requests=config["server_repeat_connections"]; + max_repeat_requests = -1; // Set to -1 (infinite loop) + else + max_repeat_requests = 1; + + count = 0; + + while ( ( (max_repeat_requests ==-1) || + (count < max_repeat_requests) ) && + (ref = server->pop()) && noSignal) + { + count ++; + + // + // We have a URL to index, now. We need to register the + // fact that we are not done yet by setting the 'more' + // variable. So, we have to restart scanning the queue. + // + + more = 1; + + // + // Deal with the actual URL. + // We'll check with the server to see if we need to sleep() + // before parsing it. + // + + parse_url(*ref); + delete ref; + + // No HTTP connections available, so we change server and pause + if (max_repeat_requests == 1) + server->delay(); // This will pause if needed + // and reset the time - // - // Deal with the actual URL. - // We'll check with the server to see if we need to sleep() - // before parsing it. - // - server->delay(); // This will pause if needed and reset the time - parse_url(*ref); - delete ref; - } + } + } } + // if we exited on signal if (Retriever_noLog != log && !noSignal) { @@ -1562,5 +1603,28 @@ Retriever::ReportStatistics(const String cout << "\n" << name << ": Errors to take note of:\n"; cout << notFound; } + + cout << endl; + + // Report HTTP connections stats + cout << "HTTP statistics" << endl; + cout << "===============" << endl; + + if (config.Boolean("persistent_connections")) + { + cout << " Persistent connections : Yes" << endl; + + if (config.Boolean("head_before_get")) + cout << " HEAD call before GET : Yes" << endl; + else + cout << " HEAD call before GET : No" << endl; + } + else + { + cout << "Persistent connections : No" << endl; + } + + HtHTTP::ShowStatistics(cout) << endl; + } Index: Server.cc =================================================================== RCS file: /opt/htdig/cvs/htdig3/htdig/Server.cc,v retrieving revision 1.17.2.6 diff -3 -u -p -r1.17.2.6 Server.cc --- Server.cc 1999/12/11 16:19:47 1.17.2.6 +++ Server.cc 2000/01/21 11:49:49 @@ -21,6 +21,7 @@ #include "Document.h" #include "URLRef.h" #include "Transport.h" +#include "HtHTTP.h" // for checking persistent connections #include @@ -38,8 +39,10 @@ Server::Server(URL u, String *local_robo _port = u.port(); _bad_server = 0; _documents = 0; - _persistent_connections = 1; // Allowed by default + // We take it from the configuration + _persistent_connections = config.Boolean("persistent_connections"); + _max_documents = config.Value("server",_host,"server_max_docs", -1); _connection_space = config.Value("server",_host,"server_wait_time", 0); _last_connection.SettoNow(); // For getting robots.txt @@ -78,7 +81,23 @@ Server::Server(URL u, String *local_robo } } else if (!local_urls_only) + { status = doc.Retrieve(timeZero); + + // Let's check if persistent connections are both + // allowed by the configuration and possible after + // having requested the robots.txt file. + + HtHTTP *http; + if (IsPersistentConnectionAllowed() && + (http = doc.GetHTTPHandler())) + { + if (! http->isPersistentConnectionPossible()) + _persistent_connections=0; // not possible. Let's disable + // them on this server. + } + + } else status = Transport::Document_not_found; [ Part 3: "Attached Text" ] ------------------------------------------------- Gabriele Bartolini Computer Programmer (are U sure?) U.O. Rete Civica - Comune di Prato Prato - Italia - Europa e-mail: g.bartol@comune.prato.it http://www.po-net.prato.it ------------------------------------------------- Zinedine "Zizou" Zidane. Just for soccer lovers. ------------------------------------------------- ------------------------------------------------- [ Part 4: "Attached Text" ] ------------------------------------ To unsubscribe from the htdig3-dev mailing list, send a message to htdig3-dev-unsubscribe@htdig.org You will receive a message to confirm this.