Searching, Normally
Before we can build a Symmetric Searchable Encryption scheme, we need to understand how searching works in a case where privacy isn't a concern.
Indexing
An index is, in our case, a mapping from a document id to that document's content. In the case of Matrix message events, the documents are non-redacted messages and their ids are the event ids.
So to create an index, we first obtain a list
from the JSON
file exported from a room.
Next, we iterate over the events and store the valid ones in a
dict
.
And we're done with indexing! If we wanted to, we could search by iterating
over each document in messages
and looking for the query in
that message. The time complexity for such a searching operation would be
O(<number of messages>
× <average length
of message>)
. That's terrible.
We can do much, much better. How? Well, by—
Inverting
A regular index consists of a mapping from document id to document content. An inverted index consists of a mapping from a token to a list of document ids in which it was found.
Tokens
A token is an atomic part of a document's content that can be searched for. For English and similar languages, this is a regular word that carries meaning. To obtain tokens from a documents string, we do the following:
-
Remove all punctuation from the string
- Convert the string to lowercase
- Split the string into a list of words
- Remove any stopwords from the list
Once we've tokenized all the documents' contents, we extract all the unique keywords from them.
Armed with tokenized documents and a set of keywords, we can finally getting to inverting. The algorithm is simple: For each keyword, we find the list of documents containing that keyword and store their ids against the keyword itself.
Now what have we achieved? Well, the procedure to search over an inverted index is to a simple key lookup in the index. For a hash table, this takes O(1)
time!
Conclusion
By doing a few extra operations while setting up our index, we've drastically reduced the time required for a search query to execute. But this system is not great for cases where privacy is valued. More on that in the next post about Leakages.
Further
If, instead of just storing the document ids, we store the document ids and the position where the keyword occurs, we can extend this system to phrase search. By looking for documents where two keywords occur in close proximity, we can let user search for phrases. However, I will not go into details because phrase search via proximity is even less privacy-friendly, and the SSE scheme that will be implemented in this project will not be using that.
Ciao for now!
Comments
Post a Comment