[Week 3] The Hidden Challenge

 This week was a doozy! Implementing the Search algorithm took more time than I'd expected.

Summary

  • Added tests for distribute
  • Implemented the Search algorithm
  • Added tests for locate and lookup

Details

The first insight I had when I started working this week was that I needed a stronger Location model, one that could adapt to both local and remote locations. Since that meant leaving NamedTuple behind, I had to override __eq__ and __hash__ method to make Location objects hashable, and add a deserializer.

Once that was done, I planned the structure of the Search code. I knew that I was going to delegate the "fetching the files" portion of the algorithm to the client using the library, so the process had to broken into three steps:

  1. Lookup: Find relevant locations in all lookup tables in the scope of this search.
  2. Fetch: Get relevant files from content repositories.
  3. Locate: Collect and shortlist event ids within fetched files.

Step 1 and 3 became methods of the EncryptedSearch class. Additionally, as step 0, the EncryptedSearch object needs to be initialized with all available/necessary lookup tables.

Slight detour: By supporting multiple indices in the same search object, the intention is to give the client the ability to search all the indices in a room (or even all indices in all the rooms) at the same time.

Anyway, because of the optimizations made during week two (Implementation of the Search algorithm), the locate and lookup methods themselves are pretty simple. And to make things simple for the user, the methods deal in explicit Matrix Content Repositories, not exposing any other implementation details including interpreted keywords and locations looked up.

Writing tests for these methods, though, was a pain in the butt. Honestly, the temp code code I write to produce test cases, I think, is longer than the actual test case and the main library code!

Now I must confess, I did lie about the search methods being "pretty simple". They're theoretically simple. They aren't "pretty simple" when dealing with phrase search. The search query needs to be parsed into normalized tokens. (This is not the difficult part: I already have a utility method for this.)

The problem is subtokens.

See, when the indexing methods encounter a word like "matrix.org" in the document content, they store store keywords: "matrix" and "matrix.org". This is to avoid splitting significant content into subtokens that are by themselves meaningless. For example, if we index "alice.0@example.co.us", we'll store "alice", "example" and "co" and "alice.0@example.co.us" (losing the stopwords "us"). If we didn't store the main token, we won't be able to search for the email at all.

Anyway, while searching for phrases we need to ensure the following:

  • Find each main token's docs
  • For sub tokens of each main token:
    • Create temp set
    • For each sub token:
      • Find set of docs that contain sub token
      • Take intersection with temp set
    • Add temp set to the main token's results
  • Find intersection of all main token results

Complicated? Yeah.

Put simply, when searching for "cheese-pizza bazooka", documents that contain "cheese and some pizza" need to be treated as containing "cheese-pizza". If the document content is "I shot some cheese and some pizza out of a bazooka", it needs to show up as a result of the search.

So the important takeaway is that we to distinguish between main and sub tokens at each step, including normalization. It doesn't change a lot, but it changes enough. As of typing this post, I'm finalizing the patch commit that fixes this. Hopefully it'll be done in an hour.

PS: Weird discovery– The parser is letting empty strings and single letter strings through. I've fixed it for empty strings, but single-letter strings give me pause. Those occur rarely and may have some importance. But maybe they need to be remove from sub tokens, because storing the "e" in "i.e." is wasteful to say the least.

Comments

Popular Posts