1 / 15

Improving Software Package Search Quality

Search engines for software packages typically perform poorly ... Every package is associated with a website that contains much more detailed information ...

HarrisCezar
Télécharger la présentation

Improving Software Package Search Quality

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


    Slide 1:Improving Software Package Search Quality

    Dan Fingal and Jamie Nicolson

    Slide 2:The Problem

    Search engines for software packages typically perform poorly Tend to search project name and blurb Dont take quality metrics into account Poor user experience

    Slide 3:Improvements in Indexing

    More text relating to the package Every package is associated with a website that contains much more detailed information about it We spidered those sites and stripped away the html for indexing

    Slide 4:Improvements in Quality Metrics

    Software download sites have additional data in their repositories Popularity: Frequency of download Rating: User supplied feedback Vitality: How active development is Version: How stable the package is

    Slide 5:System Architecture

    Pulled data from Freshmeat.net and Gentoo.org into mySQL database Spidered and extracted text for associated homepages Text indexed with Lucene, using Porter Stemming and a Stop Words list Similarity metrics as found in CS276A

    Slide 6:Ranking Function

    Lucene returns an ordered list of documents (packages) Copy this list and order by the various quality metrics Scaled Footrule Aggregation combines lists into one results list Scaled Footrule and Lucene field parameters can be varied

    Slide 7:Scaled Footrule Aggregation

    Scaled footrule distance between two rankings of an item is the difference in its normalized position E.g., one list puts an item 30% down, another puts it 70% down. Scaled footrule is |.3 - .7| = .4 Scaled footrule distance between two lists is the sum over the items in common Scaled footrule of a candidate aggregate list and the input rankings is the sum of the distances between the candidate and each input ranking Minimum cost maximum over bipartite graph between elements and rank position optimizes

    Slide 8:Measuring Success

    Created a gold corpus of 50 queries to relevant packages One best answer per query Measured recall within the first 5 results (0 or 1 for each query) Compared results with search on packages.gentoo.org, freshmeat.net, and google.com

    Slide 9:Sample Gold Queries

    C compiler: GCC financial accounting: GNUCash wireless sniffer: Kismet videoconferencing software: GnomeMeeting wiki system: Tiki CMS/Groupware

    Slide 10:Training Ranking Parameters

    Boils down to choosing optimal weights for each individual ranking metric Finding an analytic solution seemed too hairy, so we used local hill-climbing with random restarts

    Slide 11:Training Results

    Uniform parameters gave a recall of 29/50 (58%) Best hill-climbing parameters gave recall of 37/50 (74%) We're testing on the training set, but with only five parameters the risk of overfitting seems low Popularity was the highest-weighted feature

    Slide 12:Comparing to Google

    Google is a general search engine, not specific to software search How to help Google find software packages? Append Linux Software to queries How to judge the results? Package name appears in page? Package name appears in search results?

    Slide 14:Room for Improvement

    Gold corpus Are these queries really representative? Are our answers really the best? Did we choose samples we knew would work well with our method? Training Process Could probably come up with better training algorithms

    Slide 15:Any questions?

More Related