(Normally I'd just skip threads like this and the last one, but I'm trying to post here more so... yeah ... Worked on it a bit yesterday and today. Put in a few missing thingies and am currently debating whether to use a 3D perspective or orthographic camera for map areas/everything else.. I'm on the fence, but for now it is a 2D camera. Anyway, back on topic...)
As you may or may not know I pretty much hate the c++ standard template library--I think it's bloated junk. So I just "roll my own" *whatever* whenever I need something. They are faster and I can change anything I want about it pretty much whenever I feel like it and know exactly what it does under the hood, so I see it as a win win. It also keeps compile times down and exe sizes smaller yada yada. Point is I am completely anti-pattern guy and definitely in the minority on this issue. Truth be told I don't actually even use c++11 or c++14. I think the language is getting too ugly and bloated, and there's no features I really would even use in it(except maybe typed enums or usings). It's like the hero Gotham wanted, but not the one it needs right now. (shakes hands at clouds)
So I roll my own stuff occasionally and finally got around to putting in a hash table. Before I was kind of just using sorted arrays and faking it using binary searches.
For other projects if I needed a hash map I actually used to use rde::hash_map: http://msinilo.pl/blog/?p=675
Google hash maps are also pretty good: http://code.google.com/p/sparsehash/
Truth is that making your own hash map is kind of stupid, since there's tons of good ones available, and it's easy to screw up a minor detail here or there thus giving you a real headache to debug down the road. Yes, I am definitely a dumb-ass sometimes. heh. ..But! Just using a google thingy is no fun now, is it?!
The use case is I wanted open addressing (aka; raw array) and linear probing (so entries would be preferably in a single cache-line on lookups), and insert and lookup times to be as fast as possible, and for removals I envision two things: 1) erases are not speed important, and 2) erasures may be skipped entirely, depending on use, and just cleared instead. Another detail of open hashtables is performance can rapidly deteriorate due to having to mark deleted entries as 'REMOVED' to handle collisions properly, and eventually they can become just like linear array lookups unless it's rehashed. Another problem is when the load capacity of a hash map becomes too great (usually after 70% load or so) performance can also degrade because of key clustering, so you have to over allocate the memory you actually need. This is a real killer, and may be one reason why most hash maps use chaining and linked-lists approach, such as the stl, for example, which gives shitty performance and shitty memory access patterns... And that's why I always just put it off for a rainy day.
However, this week I was thinking about using a hash map for something, and then it just hit me: You don't need to mark nodes as deleted if you are just storing POD in them, you can just 'uncollide' them! When removing an entry you can look ahead until you find an unused slot and fix hash collisions along the way, thus eliminating the need to have handle special cases when looking up data. So I called it reverse colliding removals. I had just invented a better hashtable! yayzors. ...or so I thought. Turns out they used that concept in the rust programming language library. It's called a 'backward shift deletion'. Of course I didn't know that until after I had written it and decided to look it up. Hahaha..
It was still a fun exercise to do though. So whatever. And on the plus side it has very good performance. So much so that it probably is a good basis for comparison for serious implementations.
I also made a unit test and benchmark. You guys wanna see it here it goes:
ea is the eastl implementation: http://www.open-std.org/jtc1/sc22/wg...007/n2271.html
[Array baseline is cost of reading/writing from a simple heap array. Hashes are generated using an Fnv1 hash; all maps use identical keys. Time is measured in cycles; lower is better. All optimizations are enabled]
First thing worth mentioning is that std::map<string> is just pure crap. If you don't actually need a self-balancing red-black tree that holds strings you should use something else in 100% of all cases. End of discussion. Also, std::*maps in general are just awful. eastl is simply better.Code:[07:34:24] --- Speed Test. --- [07:34:24] --- Iteration Size = 256. --- [07:34:24] --- HashTable Load Factor: 50 percent. --- [07:34:24] [07:34:24] [07:34:24] raw array baseline: 936 [07:34:24] HashTable insert: 11730 [07:34:24] ea::hash_map insert: 215982 [07:34:24] std::hash_map insert: 240779 [07:34:24] std::map insert: 221015 [07:34:24] std::map<str> insert: 648960 [07:34:24] rde::hash_map insert: 24358 [07:34:24] [07:34:24] HashTable find: 8054 [07:34:24] ea::hash_map find: 21791 [07:34:24] std::hash_map find: 59153 [07:34:24] std::map find: 32327 [07:34:24] std::map<str> find: 310925 [07:34:24] rde::hash_map find: 25475 [07:34:24] [07:34:24] HashTable remove: 25733 [07:34:24] ea::hash_map remove: 26042 [07:34:24] std::hash_map remove: 46009 [07:34:24] std::map remove: 102566 [07:34:24] std::map<str> remove: 174009 [07:34:24] rde::hash_map remove: 9122 [07:34:24] [07:34:24] HashTable find: 9262 [07:34:24] ea::hash_map find: 25003 [07:34:24] std::hash_map find: 62802 [07:34:24] std::map find: 48260 [07:34:24] std::map<str> find: 238944 [07:34:24] rde::hash_map find: 22395 [07:34:24] [07:34:24] --- Speed Test. --- [07:34:24] --- Iteration Size = 384. --- [07:34:24] --- HashTable Load Factor: 75 percent. --- [07:34:24] [07:34:24] [07:34:24] raw array baseline: 5460 [07:34:24] HashTable insert: 22976 [07:34:24] ea::hash_map insert: 149626 [07:34:24] std::hash_map insert: 333290 [07:34:24] std::map insert: 373634 [07:34:24] std::map<str> insert: 1070055 [07:34:24] rde::hash_map insert: 69063 [07:34:24] [07:34:24] HashTable find: 18781 [07:34:24] ea::hash_map find: 40914 [07:34:24] std::hash_map find: 81796 [07:34:24] std::map find: 73832 [07:34:24] std::map<str> find: 450250 [07:34:24] rde::hash_map find: 38362 [07:34:24] [07:34:24] HashTable remove: 17609 [07:34:24] ea::hash_map remove: 40812 [07:34:24] std::hash_map remove: 64309 [07:34:24] std::map remove: 88554 [07:34:24] std::map<str> remove: 288500 [07:34:24] rde::hash_map remove: 17545 [07:34:24] [07:34:24] HashTable find: 16240 [07:34:24] ea::hash_map find: 41548 [07:34:24] std::hash_map find: 55731 [07:34:24] std::map find: 52404 [07:34:24] std::map<str> find: 413939 [07:34:24] rde::hash_map find: 28567 [07:34:24] [07:34:24] --- Speed Test. --- [07:34:24] --- Iteration Size = 512. --- [07:34:24] --- HashTable Load Factor: 100 percent. --- [07:34:24] [07:34:24] [07:34:24] raw array baseline: 7902 [07:34:24] HashTable insert: 51780 [07:34:24] ea::hash_map insert: 214068 [07:34:24] std::hash_map insert: 428346 [07:34:24] std::map insert: 389130 [07:34:24] std::map<str> insert: 1315013 [07:34:24] rde::hash_map insert: 70932 [07:34:24] [07:34:24] HashTable find: 48481 [07:34:24] ea::hash_map find: 43351 [07:34:24] std::hash_map find: 98816 [07:34:24] std::map find: 97720 [07:34:24] std::map<str> find: 587472 [07:34:24] rde::hash_map find: 54804 [07:34:24] [07:34:24] HashTable remove: 108703 [07:34:24] ea::hash_map remove: 50633 [07:34:24] std::hash_map remove: 104170 [07:34:24] std::map remove: 112113 [07:34:24] std::map<str> remove: 273662 [07:34:24] rde::hash_map remove: 7499 [07:34:24] [07:34:24] HashTable find: 32895 [07:34:24] ea::hash_map find: 43203 [07:34:24] std::hash_map find: 97939 [07:34:24] std::map find: 96576 [07:34:24] std::map<str> find: 523633 [07:34:24] rde::hash_map find: 362372
At the end the numbers get a little squirrel-y. The order in which you insert keys also affects the clustering with open addressing. If I change the map sizes under 98-99% load, the results change for HashTable and rde::hash_map, though still stay roughly the same. The main point is that with any decent hash the backwards shifting does fix up the problem of the worst case under heavy load, and net you (generally) fast access times.
So to summarize: I didn't actually accomplish anything useful today.
[edit]
source code: http://code.google.com/p/catastrophe...rs/HashTable.h