Hello, I am thinking of using the GeoIP module with input from the maxmind database converted with the perl script as described through the link on the nginx site. I'm curious if the country-ip pairs are managed efficiently so that the lookup/conversion is very fast or not? That is, does the module do something like sort the list and then use a binary tree to quickly locate the country? Is the whole thing loaded in memory? This country database is quite huge and if this process happens on every hit or even on only a selected entry page then it could be very slow. Does anyone here have experience with this? For my purposes I only really need to detect continents for deciding if visitors should pull from one of a few server locations. So presumably it may be possible to combine many countries into larger blocks so that there are fewer steps in the lookup. Any input on how speedy or efficient this has shown to be would be super helpful here. Thanks, Chris :)
on 16.08.2008 02:54
on 16.08.2008 05:29
Hello! On Sat, Aug 16, 2008 at 07:43:45AM +0700, Chris Savery wrote: > I am thinking of using the GeoIP module with input from the maxmind > database converted with the perl script as described through the link on > the nginx site. > > I'm curious if the country-ip pairs are managed efficiently so that the > lookup/conversion is very fast or not? That is, does the module do > something like sort the list and then use a binary tree to quickly > locate the country? Is the whole thing loaded in memory? Geo module builds in-memory radix tree when loading configs. This is the same data structure as used in routing, and lookups are really fast. > This country > database is quite huge and if this process happens on every hit or even > on only a selected entry page then it could be very slow. Does anyone > here have experience with this? The only inconvinience of using really large geobases is config reading time. My currently takes about 30 seconds to load - but that's for more than 30 Mb of data, and not only countries. > For my purposes I only really need to detect continents for deciding if > visitors should pull from one of a few server locations. So presumably > it may be possible to combine many countries into larger blocks so that > there are fewer steps in the lookup. Any input on how speedy or > efficient this has shown to be would be super helpful here. Aggregating blocks is good thinks to do if you don't need detailed information, but you'll hardly notice any difference. Maxim Dounin
on 16.08.2008 07:06
On Sat, Aug 16, 2008 at 07:22:20AM +0400, Maxim Dounin wrote: > >This country > >database is quite huge and if this process happens on every hit or even > >on only a selected entry page then it could be very slow. Does anyone > >here have experience with this? > > The only inconvinience of using really large geobases is config > reading time. My currently takes about 30 seconds to load - but > that's for more than 30 Mb of data, and not only countries. If you have many unique values per networks, then this long load time is caused by searching duplicates of data in array. Otherwise, it may be caused by insertions to a radix tree.
on 16.08.2008 07:56
Hello! On Sat, Aug 16, 2008 at 08:58:03AM +0400, Igor Sysoev wrote: > >If you have many unique values per networks, then this long load time >is caused by searching duplicates of data in array. Otherwise, it >may be caused by insertions to a radix tree. Yes, I've read code and in my case it looks like unique values search. One day I'll probably try to implement rbtree there, but currently it doesn't bugs me too much. :) Maxim Dounin
on 16.08.2008 10:10
Thanks Maxim. Sounds cool, fast but perhaps a bit of a memory hog. For loading time I would think the way to improve that is to compile a binary representation on disk that can be loaded as a "pre-made tree" into memory so that no insertion scan need be done. Or pre-sort the data to insert with minimum searches. Anyway, I may write a small script to see if I can amalgamate countries into big blocks as that would help both speed and memory. I gather that being at the http level config this means it is "always on". I could see it being useful to be able to put it in a location specifier so that only certain requests go through the lookup. For example, I've no need for static images to get country codes but my index page would be great as I would set a "best choice" value for serving the user for all further requests in the session. It sounds like it doesn't use much cpu time but I expect to be serving vasts amounts of small thumbnails so reducing cycles on that is always a good thing. (25 thumbs/page/user ad nauseum photo app). Cheers, for excellent info. Chris :)
on 16.08.2008 11:20
On Sat, Aug 16, 2008 at 03:05:56PM +0700, Chris Savery wrote: > Thanks Maxim. Sounds cool, fast but perhaps a bit of a memory hog. For > loading time I would think the way to improve that is to compile a > binary representation on disk that can be loaded as a "pre-made tree" > into memory so that no insertion scan need be done. Or pre-sort the data > to insert with minimum searches. > > Anyway, I may write a small script to see if I can amalgamate countries > into big blocks as that would help both speed and memory. We at Rambler use geo base with countries and Russian regions: >wc geo.conf 141240 282480 2979471 geo.conf Your base will probably be even lesser (as Russia will be one country). > I gather that being at the http level config this means it is "always > on". I could see it being useful to be able to put it in a location > specifier so that only certain requests go through the lookup. For > example, I've no need for static images to get country codes but my > index page would be great as I would set a "best choice" value for > serving the user for all further requests in the session. It sounds like > it doesn't use much cpu time but I expect to be serving vasts amounts > of small thumbnails so reducing cycles on that is always a good thing. > (25 thumbs/page/user ad nauseum photo app). All nginx variables are evaluated on demand only, therefore geo variables are looked up only if they are really used in a request.
on 16.08.2008 12:33
>All nginx variables are evaluated on demand only, therefore geo variables >are looked up only if they are really used in a request. Ok. Excellent, so if I only include the fastcgi param line for one location, say for index.php then it would only evaluate under that condition to pass thru to php, like this: fastcgi_param COUNTRY $geo; Which is easy then... Thank very much, Chris :)
on 16.08.2008 13:44
On Sat, Aug 16, 2008 at 05:27:47PM +0700, Chris Savery wrote: > >All nginx variables are evaluated on demand only, therefore geo variables > >are looked up only if they are really used in a request. > > Ok. Excellent, so if I only include the fastcgi param line for one > location, say for index.php then it would only evaluate under that > condition to pass thru to php, like this: > > fastcgi_param COUNTRY $geo; > > Which is easy then... Yes. Actually even if you set fastcgi_param on http level, it will eventually be inherited on all localtions level (unless overridden), but it will execute only when fastcgi_pass directive will start to work.
on 16.08.2008 16:18
I wrote a quick php script to amalgamate the ip ranges into larger
regions than countries. It takes large groups of countries and breaks
them into user defined groups (for me NA, EU, AS). Doing this drops the
line count from 104K to about 33K and after run through the perl script
the conf file is 1.5MB instead of over 3MB. So that's not a bad savings.
I checked a lot of the regions manually to be sure it was working so I
think it's ok.
I'll post the code here just in case anyone else can use it. Sorry it's
not perl - I learned it a decade ago but never use it so didn't want to
brush up. This works. I just want to set the correct image server for a
visitor so they get faster photos.
I guess the best thing would be to do a set of GETs from the client to
each server on demand and then choose the image server with best times -
then it adapts real time. Didn't think of that til now...
Chris :)
<?php // Combine regions in GeoIP Database
$regions = array(
'NA' => 'US CA MX PR VI BM BO BS DM AR BZ BR CL PN AD AI AG AW AT BB
BA BG KY CO '.
'CR CU DM EC SV GQ GP GT HT HN JM NR NI PY PE PL RU RO TT TC ',
'EU' => 'EU GB DE FR IT ES SE IR NL BE IE IL CH AL AM BY HR CY CZ DK
EE FI GE GI '.
'GR GL GG HU IS LB LY LI LT LU MC ME MS NO PT RS SK SI TR UA VA
'.
'ZA GA EG NA NG ZW BJ GH CG MW UG SC TZ TM KE RW TZ SO SR SY TM
AE UZ AF DZ AO AZ BI '.
'CV CF TD IQ JE LV MR MQ MU MN SA SL CI NE LS SZ MG SL AO BF MU
TG LY SN SD RE CV GQ '.
'ZM BW CD TN BJ TG BT BW DJ ER ET JO KZ KW KG LB OM QA ',
'AS' => 'JP IN AU NZ TH CN HK MY PK KR HK SG BD ID TW PH LK VN AP AS
AQ TO KH '.
'CK FJ GN LA MO MM NP NC PG PN WS ST'
);
$other = 'NA';
$geo = fopen('GeoIPCountryWhois.csv', 'r');
$r = $w = 0;
$last = fgetcsv($geo);
$last_region = region($last);
while($line = fgetcsv($geo))
{
$r++;
if(($region = region($line)) != $last_region)
{
print '"'.join('","', array($last[0], $last[1], $last[2],
$last[3], $last_region, '-'))."\"\n";
$last = $line;
$last_region = $region;
$w++;
}
else
{
$last[1] = $line[1];
$last[3] = $line[3];
}
}
fclose($geo);
//print "$r => $w\n";
function region($vars)
{
global $regions, $other;
$found = false;
foreach($regions as $r => $codes)
if(strpos($codes, $vars[4]) !== false)
{
$found = $r;
break;
}
return $found ? $found : $other;
}
?>