JARVIS: Helping LinkedIn Navigate its Source Code


Relevance

Relevance is a very important piece for any search system, and our codesearch is no exception. It is very important to show files at the top that users are most likely to open. Relevance for us involves assigning a score to every retrieved document, ranking them based on this score, and then returning the top K results.

Score assignment is the most critical part for our relevance. We have multiple types of features, and the final score is a linear combination of scores for each feature.

Some of the high level features are:

Match info
This is a very important component of our score. It helps us keep the final result as topical as we can. A match in each field incurs a boost, and the amount of boost a match receives depends on the importance of the field. We have different weights for different fields.

Importance Score
We also have a Hadoop flow to assign scores based on the number of inward and outward edges. We compute this score at both a file- and a project-level to determine its importance. For a Java file, we take the imports and construct the dependency graph from this file to all the import class files. Similarly, for a project, a dependency graph is constructed based on the dependency information in its build files. Hadoop flow uses PageRank implementation from Jung to compute these scores.

Above flow is run on these graphs in two ways. First, it is run with the edges in the dependency graph from a dependant node to its dependency node. The scores computed in this phase are called “authority scores,” where nodes having higher scores means that they are source files/libraries, etc. For example, a source file would be at a higher score compared to its test file and, similarly, a library would be at a higher score compared to its dependencies.

Next we invert the edges in the dependency graph, so that the edges are from dependency to its dependent node. The scores computed in this phase are called “hub scores.”

The intuition behind computing hub scores is that nodes (files/projects) where the integration happens (that is, they refer a lot of files/libraries) will have a higher score compared to others.

So, we end up with four different scores for each file:

  • Hub score for a file

  • Hub score for the project where this file resides

  • Authority score for a file

  • Authority score for the project where this file resides

We take a weighted sum of these scores, with a higher weight for authority scores, to compute the final importance score for a file.

Interpreted Query Score
We try to interpret the query, and if there is a match for query interpretation in the retrieved documents, we try to boost its rank by increasing its score.

File Size Score
This we use to demote files that are too big or too small.

Conclusion

Having a tool like JARVIS gives a big operational boost to an engineering organization. Today, code search is being used by Devs, SREs, and Ops alike—it has become the defacto way of finding code at LinkedIn.

JARVIS is able to scale horizontally with little friction, which helped us onboard various types of code repositories with ease. It also supports nearline ingestion, meaning that after the code is committed, it will be available for searching in nearline fashion.

Building JARVIS at LinkedIn gave us a taste of problems which are not commonly seen in other search systems. For example, tokenization for us is completely different and much more sophisticated than many other search systems, because the expectation from the query here is vastly different. It also gave us the opportunity to showcase how Galene can empower such a search system.

Acknowledgments

JARVIS is the result of collaborative efforts from Bangalore Search, Bangalore Tools, and the Bangalore Search SRE teams. We would like to thank Chandramouli Mahadevan, and Abhijit Belapurkar for their guidance and support; Manoj K. Sure, Shubham Agarwal, Manoj A. Bode, Akhil Thatipamula, Sachin Hosmani, and Mansi Gupta and all the interns for their contribution toward search backend and frontend; Prince Valluri and Naman Jain for building the data pipeline; Binish Rathnapalan and Gaurav Gupta for helping us do timely release; and Sanjay Singh for QA support, which helped us make our code more robust. Lastly, we would like to thank Galene team as well, this would not have been possible without their platform support.



Source link