Open
Conversation
Thynix
reviewed
Nov 30, 2022
| import java.util.LinkedHashMap; | ||
| import java.util.List; | ||
| import java.util.Map; | ||
| import java.util.zip.ZipException; |
Contributor
There was a problem hiding this comment.
These are needed despite no other changes to this file?
There was a problem hiding this comment.
These are needed despite no other changes to this file?
It appears Arne forgot to include the other changes in IPConverter, which you can see are explained in a way that it is already done. I figure he intends to come back to paste the remaining changes later.
Contributor
There was a problem hiding this comment.
@ArneBab Are these changes incomplete? Is this a draft?
38cd230 to
89fdffe
Compare
a) The DB file size is bigger. Current IpToCountry.dat is 1.2 MiB, Tor DB is 4 MiB, optimized Tor DB is 2 MiB. "Optimized" means I used 'sed' to remove an unnecessary column in the DB. If you really want to go for size, you can zip it to ~700 KiB, this increases runtime a bit, but it's still ~15x faster than the old method. b) Tor uses "??" instead of "ZZ" for unknown codes, and still uses "CS" which stands for Serbia and Montenegro - the country stopped existing in 2006. Maybe someone could ask them why their DB uses "CS"... This can be solved easily by just replacing them with 'sed'. c) The new DB is sorted in ascending order, which means that the function to do the binary search has to be changed (right now I simply reverse the array), which saves another ~6 ms. I don't know how to do this. human-readable text file in a zip. Advantages: - It's human readable - It's easy to update because we can use Tor geoip - It's a lot faster than the base85 approach - It has a smaller file size ==== New zip file ==== The code no longer uses the IpToCountry.dat file and instead uses a zip file called IpToCountry.zip. This zip file is expected to contain exactly one text file in the Tor geoip format according to the spec below. The zip file should be compressed to save space (~2 MiB uncompressed -> 0.7 MiB compressed). ==== New IpToCountry.txt file ==== Format for each line: <fromIP>,<ISO 3166-1 alpha-2 country code> Example: 16781312,JP This is like to old format, but not base85 encoded. Empty lines are allowed. Comments may start with any symbol other than a number. ---------------------------------- Get the raw .txt file here: https://github.com/torproject/tor/raw/main/src/config/geoip The file has to be processed with the following three 'sed' commands: sed -E -i 's/([0-9]*),[0-9]*,([A-Z]*)/\1,\2/g' IpToCountry.txt && sed -E -i 's/,\?\?/,ZZ/g' IpToCountry.txt && sed -E -i 's/,CS/,RS/g' IpToCountry.txt 1) Remove last column, because Tor geoip format is: fromIP,toIP,countryCode. Freenet does not need to toIP value, the binary search algorithm will take care of this. 2) Replace '??' with 'ZZ' for unknown countries, because '??' is not in the ISO 3166 standard. 3) Replace 'CS' with 'RS' because the country 'CS' is not in the ISO 3166 standard. Zip this text file into IpToCountry.zip and place it in the main Freenet folder. ==== Code changes ==== The base85 code is left in the source as well as the file reader for the old format. - src/freenet/clients/http/geoip/IPConverter.java -- zip reader to save space. -- ArrayList is allocated with 180000 slots to have it not resize that many times (does not matter for speed though anyway). -- Ignore empty lines and lines that start with anything but a number (comments). -- Cast (int) to the Long value, exactly like the old code did. -- Get country, identical to old code. -- Reverse the List, because the binary search expects the list to be in descending order. Takes <10 ms. -- Convert the List<Integer/Short> to int[]/short[] to save lots of memory. See below for explanation. Takes <10 ms. -- Catch all possible errors. -- I did not feel confident in messing with the binary search because I might overlook some edge case where indexes would no longer match, so I left it alone. Reversing both arrays takes less than 10 ms combined. - src/freenet/node/NodeFile.java b/src/freenet/node/NodeFile.java -- Changed default location from 'IpToCountry.dat' to 'IpToCountry.zip'. Memory from heap dump according to VisualVM: List<Integer> vs int[]: 3.3 MiB vs 660 KiB List<Short> vs short[]: 2.0 MiB vs 330 KiB ==== Further changes (aka 'more stuff to do for Arne' :) ) ==== https://github.com/freenet/scripts#releasing-stable-freenet-builds The FAQ link has to be removed as the old IP DB site is no longer used. /scripts/setup-release-environment Has to be adjusted. How did it work in the past few years Arne, because the website has been offline for a while? The new zip file has to be added to the insert/release script.
89fdffe to
8465bc5
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
a) The DB file size is bigger. Current IpToCountry.dat is 1.2 MiB, Tor DB is 4 MiB, optimized Tor DB is 2 MiB. "Optimized" means I used 'sed' to remove an unnecessary column in the DB. If you really want to go for size, you can zip it to ~700 KiB, this increases runtime a bit, but it's still ~15x faster than the old method.
b) Tor uses "??" instead of "ZZ" for unknown codes, and still uses "CS" which stands for Serbia and Montenegro - the country stopped existing in 2006. Maybe someone could ask them why their DB uses "CS"... This can be solved easily by just replacing them with 'sed'.
c) The new DB is sorted in ascending order, which means that the function to do the binary search has to be changed (right now I simply reverse the array), which saves another ~6 ms. I don't know how to do this.
human-readable text file in a zip.
Advantages:
==== New zip file ====
The code no longer uses the IpToCountry.dat file and instead uses a zip file called IpToCountry.zip. This zip file is expected to contain exactly one text file in the Tor geoip format according to the spec below. The zip file should be compressed to save space (~2 MiB uncompressed -> 0.7 MiB compressed).
==== New IpToCountry.txt file ====
Format for each line: ,<ISO 3166-1 alpha-2 country code>
Example: 16781312,JP
This is like to old format, but not base85 encoded.
Empty lines are allowed.
Comments may start with any symbol other than a number.
Get the raw .txt file here: https://github.com/torproject/tor/raw/main/src/config/geoip
The file has to be processed with the following three 'sed' commands:
sed -E -i 's/([0-9]),[0-9],([A-Z]*)/\1,\2/g' IpToCountry.txt && sed -E -i 's/,??/,ZZ/g' IpToCountry.txt && sed -E -i 's/,CS/,RS/g' IpToCountry.txt
Zip this text file into IpToCountry.zip and place it in the main Freenet folder.
==== Code changes ====
The base85 code is left in the source as well as the file reader for the old format.
-- ArrayList is allocated with 180000 slots to have it not resize that many times (does not matter for speed though anyway). -- Ignore empty lines and lines that start with anything but a number (comments). -- Cast (int) to the Long value, exactly like the old code did. -- Get country, identical to old code.
-- Reverse the List, because the binary search expects the list to be in descending order. Takes <10 ms. -- Convert the List<Integer/Short> to int[]/short[] to save lots of memory. See below for explanation. Takes <10 ms. -- Catch all possible errors.
-- I did not feel confident in messing with the binary search because I might overlook some edge case where indexes would no longer match, so I left it alone. Reversing both arrays takes less than 10 ms combined.
Memory from heap dump according to VisualVM:
List vs int[]: 3.3 MiB vs 660 KiB
List vs short[]: 2.0 MiB vs 330 KiB
==== Further changes (aka 'more stuff to do for Arne' :) ) ====
https://github.com/freenet/scripts#releasing-stable-freenet-builds The FAQ link has to be removed as the old IP DB site is no longer used.
/scripts/setup-release-environment
Has to be adjusted. How did it work in the past few years Arne, because the website has been offline for a while?
The new zip file has to be added to the insert/release script.