Determine what "document" to store for each program element

Jan 21, 2012 at 7:23 PM

For each program element we'll need to store a "document" in Lucene.  What should be in this document?  On the one hand we could include the entire text of the program element.  On the other hand maybe we want to leave out keywords and other noise.  What do you guys think?  Note that SrcML (i.e., the engine behind the Parser) could be used to filter out the keywords pretty easily.

Jan 21, 2012 at 10:12 PM
Edited Jan 22, 2012 at 11:07 AM

Here's what I think (I use "?" for fields that are probably not obligatory but can extend the search capabilities - let me know, which are we going to store in the indexed document - maybe I added to much fields here, even the ones without the quotation mark):

  1. Class:
    1. Id (Guid)
    2. Name (string)
    3. ? Namespace (string)
    4. ? ExtendedClasses (string - so that - as an option - the user can search across the inheritance tree - i.e. "Animal" search returns also classes "Dog", "Cat" etc.)
    5. ? ImplementedInterfaces (string - like the previous one)
    6. ? AccessLevel (enum)
    7. FileName (string)
    8. DefinitionLineNumber (int)
    9. FullFilePath (string)
  2. Method
    1. Id (Guid)
    2. Name (string)
    3. Arguments (string)
    4. ReturnType (string)
    5. Body (string)
    6. ? AccessLevel (enum)
    7. DefinitionLineNumber (int)
    8. ClassId (Guid - reference so that we can extract any class-level info, like file path)
  3. Property
    1. Id (Guid)
    2. Name (string)
    3. Body (string - for the properties with custom get, set methods)
    4. PropertyType (string)
    5. ? AccessLevel (enum)
    6. DefinitionLineNumber (int)
    7. ClassId (Guid - reference so that we can extract any class-level info, like file path)
  4. Field
    1. Id (Guid)
    2. Name (string)
    3. FieldType (string)
    4. ? AccessLevel (enum)
    5. DefinitionLineNumber (int)
    6. ClassId (Guid - reference so that we can extract any class-level info, like file path)
  5. Enum 
    1. Id (Guid)
    2. Name (string)
    3. Values (string)
    4. ? Namespace (string)
    5. ? AccessLevel (enum)
    6. FileName (string)
    7. DefinitionLineNumber (int)
    8. FullFilePath (string)

Speaking about the keywords and other noise - i think it's a good idea to get rid of all operators, keywords, dots, commas and so on, however one question comes to my mind - if the method body is a long one, how will we be able to show the appropriate line with the word used in the query? Using line number of the method definition + number of line breaks within the method body, before the found element? Or maybe we stop at the method level, highlighting occurrences of the search term, but without scrolling to the line within the body? So that if we look for the word "eagle" and there is such a variable in the 12 line of the method "FeedBirds", we only show the line when this method starts and the user should find the appropriate line by its own (or better - by highlighted query text).

Waiting for any comments on that

Jan 22, 2012 at 2:13 AM

Thanks for this list!  This is a great long term list and I don't have a problem with any fields you propose.  

I did have one comment.  When searching with Lucene I think you can only search against one "field" in a document.  For instance, in the above list of {Class, Property, Field, etc.} there is not a common field that we can search against.  I think we need to have one field for every type that includes all the relevant words that represent that element.  Let's call this field the "bag of words".  

When you make a query with Lucene like this: 

Query query = new TermQuery(new Term("bagOfWords", "searchTerm"));

You need a field that exists in every program element (i.e., bagOfWords) that you can search against.  

---

Here's a few examples of what I'd consider including in the class and method's bag of words.  Let me know what you think.

For this method:

public void OpenFile(File temporary){

temporary.Open();

temporary.Copy();

}

Would have the "bag of words" field {open, file, file, temporary, temporary, open, temporary, copy}.  We include repeats because that helps the indexer understand which words are more important than others for that element.  

For the class:

public FileOpener : FileLoader{

..

}

I would have the following "bag of words" field {File, Opener, File, Loader + <the bag of words for all methods and fields in the class>}.  


Jan 22, 2012 at 2:19 AM

> Speaking about the keywords and other noise - i think it's a good idea to get rid of all operators, keywords, dots, commas and so on, however one question comes to my mind - if the method body is a long one, how will we be able to show the appropriate line with the word used in the query? 

This is a great question.  Here's an idea:

Search for "file" returns a set of documents.  Let's say the first document is the document for the Method Open File.  When a user clicks on that result we can show it in the file editor.  Because you have stored the file path and definition line number we can quickly look up the text of the method and do a local search for the text "file", which we could highlight.  Does this approach address your concern?  I think this is pretty similar (if not the same) to what you proposed...

Jan 22, 2012 at 11:07 AM
davidcshepherd wrote:

I did have one comment.  When searching with Lucene I think you can only search against one "field" in a document.  For instance, in the above list of {Class, Property, Field, etc.} there is not a common field that we can search against.  I think we need to have one field for every type that includes all the relevant words that represent that element.  Let's call this field the "bag of words". 

IndexReader reader = IndexReader.Open("<lucene dir>");
Searcher searcher = new IndexSearcher(reader);

BooleanQuery booleanQuery = new BooleanQuery();
Query query1 = new TermQuery(new Term("filename", "<text>"));
Query query2 = new TermQuery(new Term("filetext", "<text>"));
booleanQuery.add(query1, BooleanClause.Occur.SHOULD);
booleanQuery.add(query2, BooleanClause.Occur.SHOULD);
// Use BooleanClause.Occur.MUST instead of BooleanClause.Occur.SHOULD
// for AND queries
Hits hits = searcher.Search(booleanQuery);

 

I believe this is the way to create multiple-field query - I'm not sure if it works, but there are many examples on the web how to do that, so I think it's worth trying.

Jan 22, 2012 at 11:08 AM
davidcshepherd wrote:

> Speaking about the keywords and other noise - i think it's a good idea to get rid of all operators, keywords, dots, commas and so on, however one question comes to my mind - if the method body is a long one, how will we be able to show the appropriate line with the word used in the query?

This is a great question.  Here's an idea:

Search for "file" returns a set of documents.  Let's say the first document is the document for the Method Open File.  When a user clicks on that result we can show it in the file editor.  Because you have stored the file path and definition line number we can quickly look up the text of the method and do a local search for the text "file", which we could highlight.  Does this approach address your concern?  I think this is pretty similar (if not the same) to what you proposed...

Yes - this is what I wanted to confirm - I completely agree with your solution

Jan 22, 2012 at 11:15 AM

One more comment on that - looking at the whole discussion, I think it's the best for now to implement all the core classes with all the fields I mentioned (which I'm going to do today) - after that I will create the Indexer classes, which will convert the core classes into Lucene documents (subclasses of the SandoDocument class) - if we find out that the way we store the informations about the program elements is not the most efficient one (or is not efficient at all), I will only have to change these conversion classes along with the queries, but I think it's better to start with everything and limit the functionality only in case of performance (or other types) problems - are you agree with that?

Jan 22, 2012 at 12:03 PM

I also renamed all ProgramElement subclasses, appending "Element" to the class names (Class -> ClassElement) to prevent conflicts with existing .Net classes.

Jan 22, 2012 at 1:13 PM

> but I think it's better to start with everything and limit the functionality only in case of performance (or other types) problems - are you agree with that?

Yes, this sounds like a great approach to me!

Jan 22, 2012 at 3:11 PM
It seems that program comments can also be easily parsed and provided to the indexer. Perhaps we can also add another class in the Core to represent them.

On Jan 22, 2012, at 8:13 AM, "davidcshepherd" <notifications@codeplex.com> wrote:

From: davidcshepherd

> but I think it's better to start with everything and limit the functionality only in case of performance (or other types) problems - are you agree with that?

Yes, this sounds like a great approach to me!

Jan 22, 2012 at 4:11 PM
kostata wrote:
It seems that program comments can also be easily parsed and provided to the indexer. Perhaps we can also add another class in the Core to represent them.

On Jan 22, 2012, at 8:13 AM, "davidcshepherd" <notifications@codeplex.com> wrote:

From: davidcshepherd

> but I think it's better to start with everything and limit the functionality only in case of performance (or other types) problems - are you agree with that?

Yes, this sounds like a great approach to me!

I think it's a very good idea, however I suppose that this will be as used as an option, like "search within comments", turned off by default as usually you are not interested in the commented lines of code.

If Dave and other people agree to go with your idea, I can add the class you mentioned to the Core project

Jan 22, 2012 at 6:15 PM
I agree - it seems to me that comments may dominate the search results if they were not considered separately.

On Jan 22, 2012, at 11:11 AM, "lordlothar" <notifications@codeplex.com> wrote:

From: lordlothar

kostata wrote:
It seems that program comments can also be easily parsed and provided to the indexer. Perhaps we can also add another class in the Core to represent them.

On Jan 22, 2012, at 8:13 AM, "davidcshepherd" <notifications@codeplex.com> wrote:

From: davidcshepherd

> but I think it's better to start with everything and limit the functionality only in case of performance (or other types) problems - are you agree with that?

Yes, this sounds like a great approach to me!

I think it's a very good idea, however I suppose that this will be as used as an option, like "search within comments", turned off by default as usually you are not interested in the commented lines of code.

If Dave and other people agree to go with your idea, I can add the class you mentioned to the Core project

Jan 23, 2012 at 2:48 PM

You both bring up interesting points.  Two thoughts:

1. I think we should at least have an option to search comments, and we should index them.

2. While I agree that commented code isn't that interesting to search, and thus should be turned off by default, what about normal comments (e.g., //open the file and move it to the local folder).  It seems like these type of comments would be very interesting to search by default.

Either way, it sounds like a good idea to proceed with including comments in the Core project, etc.

Thanks for bringing this up!

Jan 23, 2012 at 6:29 PM

Kosta - is it possible for the parser to distinguish between normal comments and these mentioned by Dave? The first one usually start with "/**" and the second with "//", but it's not the official rule. However we can probably treat all the comments outside the methods bodies or properties bodies as the first group and all the others as the second one - are you agree with me or maybe I forgot about something?

I'm going to create two classes for these types of comments, because the second one will have only one field - something like "Body", while the first one can have some special keywords like "@param", which maybe we will want to extract and store separately in the future (so that i.e. we will be able to add the method description from the comment to the results of the method name search) - for now I will make them the same, because comment parsing seems to be out of scope for now

 

Jan 23, 2012 at 9:59 PM
I'm not sure at the moment whether the parser can distinguish between "//" and "/**/", but it certainly can tell if the comment is directly outside of a method or not.

I have another question for you Core people. At present, the MethodElement core class has Body and Arguments as strings. If we just dump those into Lucene, don't we risk poluting our results with C# keywords?

-Kosta


On Jan 23, 2012, at 1:29 PM, lordlothar wrote:

From: lordlothar

Kosta - is it possible for the parser to distinguish between normal comments and these mentioned by Dave? The first one usually start with "/**" and the second with "//", but it's not the official rule. However we can probably treat all the comments outside the methods bodies or properties bodies as the first group and all the others as the second one - are you agree with me or maybe I forgot about something?

I'm going to create two classes for these types of comments, because the second one will have only one field - something like "Body", while the first one can have some special keywords like "@param", which maybe we will want to extract and store separately in the future (so that i.e. we will be able to add the method description from the comment to the results of the method name search) - for now I will make them the same, because comment parsing seems to be out of scope for now



Jan 23, 2012 at 10:27 PM
kostata wrote:
  I have another question for you Core people. At present, the MethodElement core class has Body and Arguments as strings. If we just dump those into Lucene, don't we risk poluting our results with C# keywords?

I'm not sure if I understand the parser behavior but isn't it possible to remove all the keywords from the body and arguments and store what's left as a string?

In pseudo-code it would be something like 
var methodBody = parser.GetMethodBody().RemoveKeywords().RemoveUnnecessaryChars();

So that if you have method:

public int CalculateFactor(int oldFactor){int factor = oldFactor; factor = FactorManager.UpdateFactor(factor); return factor;}

it will be converted into:

 public int CalculateFactor(oldFactor){factor oldFactor factor FactorManager UpdateFactor factor factor}

or something similar.

 

If not - what do you get from the parser and what type does the method body should be?

Jan 23, 2012 at 10:29 PM

Dave in the first post of this discussion mentioned filtering keywords by parser - it seems like it's possible

Jan 23, 2012 at 11:07 PM
Good point. Yeah, it shouldn't be hard to do.

On Jan 23, 2012, at 5:29 PM, "lordlothar" <notifications@codeplex.com> wrote:

From: lordlothar

Dave in the first post of this discussion mentioned filtering keywords by parser - it seems like it's possible

Jan 23, 2012 at 11:11 PM

In general, Lucene has a built-in support for stopwords. At least for Java, I'm not sure what Lucene.net has to offer but I guess there is something similar to http://lucene.apache.org/java/3_5_0/api/all/org/apache/lucene/analysis/standard/StandardAnalyzer.html#StandardAnalyzer(org.apache.lucene.util.Version, java.io.File)

HTH

Marcel

Jan 24, 2012 at 2:33 AM

> So that if you have method:

> public int CalculateFactor(int oldFactor){int factor = oldFactor; factor = FactorManager.UpdateFactor(factor); return factor;}

> it will be converted into:

> public int CalculateFactor(oldFactor){factor oldFactor factor FactorManager UpdateFactor factor factor}

> or something similar.

The above is exactly right, with one small addition.  The parser should do exactly as mentioned above but it should also *split* identifiers apart.  Meaning, an identifer FactorManager should be split into the words Factor and Manager.  This will lead to better searching, as otherwise (I think) Lucene will think that FactorManager is a single word and will index it like that.  

Jan 24, 2012 at 2:36 AM
mbruch wrote:

In general, Lucene has a built-in support for stopwords. At least for Java, I'm not sure what Lucene.net has to offer but I guess there is something similar to http://lucene.apache.org/java/3_5_0/api/all/org/apache/lucene/analysis/standard/StandardAnalyzer.html#StandardAnalyzer(org.apache.lucene.util.Version, java.io.File)

HTH

Marcel

Note: This is from Marcel Bruch of the Code Recommenders project (http://code-recommenders.blogspot.com/).  He definitely knows a thing or two about code search and he's a good resource for us as we move forward with this project. 

Thanks Marcel!

Jan 25, 2012 at 5:44 AM

Thanks for introducing me, David. I'm following the project by RSS and it's great to see in which speed you guys get started with this project. Hats off! Unfortunately I'm too short on time to participate more (and not C#/VS guy)  but I would be happy if I can be of help by discussing some ideas. Best, Marcel