Designing Sando's ranking algorithm is equal parts art and science. While Sando relies on known and capable information retrieval algorithms, and uses a solid tool (Lucene) as its index, there are a number of things in which Sando makes some educated guesses.
For instance, how much to weigh matches inside program element names (e.g. class name, method name) vs. matches in program element bodies. Currently, Sando gives a 4x boost to names, but we have no way of knowing that this is correct for most users.
So, in effort to make Sando better, we need to collect some user data and get a sense of how Sando is doing and how it could be improved
Corpus (Indexing) Data
- corpus size
- time to complete indexing
- languages used
- user id
- the ordered number of the query
- the type of query (multi word, camel cased single word, plain single word, quoted multi word)
- retrieval timestamp
- the number of Sando results
- Implicit Feedback Data - computed on click and on the event of a new query (i.e. when interaction is complete)
- was anything clicked?
- the ordered rank and type (e.g. method, field, class) of each of the clicked results
- similarity between the query and clicked result's (result name or body) (partial match, exact match)
- metrics between clicked items and items near the top that were not clicked ???
- # of words in query
- query matches the code exactly?
Progression of query over time as set of operations
- add <noun>, add <verb>, delete, modify
Things above what you clicked on:
- co-occurrence of query terms (right next to each other)
None of the data above should be usable in identifying neither the user nor his or her code base. If there are such concerns, please contact us and let us know.