Scalable Cross-Comparison

Scalable Cross-Comparison Requirements Review

Scalable Cross-ComparisonRequirements in Brief 1. Product: cross-match candidates from pairs of tables 2. Inputs: tables from multiple sources and distance 3. Comparison distances large enough for science use 4. Grouping / sorting / filtering functionality on result 5. Minimum set of important catalogs 6. Leverage pre-generated pair-wise comparisons 7. Work at scale 8. User control over N-way comparisons 9. Program-friendly interface and GUI 10. Output compatible with other VO services 11. Use TAP and VOSpace

Scalable Cross-ComparisonProduct: cross-match candidates from pairs of tables 1. The service must produce cross-match candidates for a pair of tables without prejudging which is the "preferred" cross-identification. 1.1. Given a pair of tables or references to get them, the service must produce a list of all sources matching within a positional tolerance. 1.2. For all cross-match candidates, the service must report the distance, position angle, and user-specified fields from the two input tables. The VAO Cross-Comparison services are not meant to produce cross identifications but rather to efficiently (and where possible cleverly) determine all the candidates for such an identification, based entirely on positional proximity. The user identifies two tables (in a variety of ways; see later Requirements for details) and a positional tolerance and the Cross-Comparison service returns a “joined” table combining matched records in the two inputs plus attributes (distance, position angle) of the match.

Scalable Cross-ComparisonInputs: tables from multiple sources and distance 2. The service must accept input tables from multiple sources. 2.1. The service must accept on-line catalog references as input tables. 2.2. The service must accept user-uploaded files (VOTable and other TBD formats) as input tables. There are a number of ways a user could get a Cross-Comparison service to use a specific table: Upload it directly (e.g. as a VOTable in an HTTP POST); specify a URL to a stored file (e.g. a VOTable in a VOSpace or elsewhere); refer to a “well-known” catalog by name or identifier (e.g. “DENIS” or “ivo://CDS.VizieR/B/denis”); or pass in an ADQL query string that will result in a table when executed. All of these should eventually be supported; the Requirement here is meant to define an initial minimum subset (uploads and well-known catalogs).

Scalable Cross-ComparisonComparison distances large enough for science use 3. The service must support comparison distances large enough to account for the proper motions of Galactic (i.e. extra-solar) objects (TBD but several arcmin minimum). There are actually a lot of variants here, more than makes sense to enumerate in a set of Requirements. The smaller the tables being compared, the larger the tolerance that is acceptable and even a small table compared to a large one can allow pretty large values (e.g. 1000 user sources vs. SDSS with a 2-degree tolerance). For very-large vs. very-large (e.g. 2MASS vs. SDSS) it is more appropriate to keep the tolerances smaller, both for practical reasons (result size and compute time) and for scientific reasons (too much extraneous information).

Scalable Cross-ComparisonGrouping / sorting / filtering functionality on result 4. The service should provide grouping / sorting and hopefully filtering functionality for combined results. For instance, deciding between a group of potential matches is an evaluation of the distances for each match (preferably sorted by distance) and physical measures such as cross-catalog color. The current set of these services are being implemented in conjunction with relational databases (though this is not itself a requirement). It therefore makes sense to allow for at least some of this database technology to be factored into the Cross-Comparison service (efficiency of location). In general, though, this is really more of a TAP capability, separate from Cross-comparison. We are sitting on the fence a little here; willing to provide some of this functionality but not wanting to factor in all the TAP/ADQL capabilities/syntax.

Scalable Cross-ComparisonMinimum set of important catalogs 5. The service must have access to minimum set of important catalogs to be chosen during the design phase (with VAO Project management oversight). Comparing two user catalogs requires no prior setup but if we are going to compare to “well-known” catalogs we need copies of them or at least access to them. This is a non-trivial amount of work so we need the VAO Project to identify up front which catalogs need to be included.

Scalable Cross-ComparisonLeverage pre-generated pair-wise comparisons 6. The service may leverage pre-generated pair-wise comparisons when scientifically appropriate (either by user specification or based on requested radius). A lot of catalog combinations will be wanted by many users (USNO-B1.0 vs. 2MASS vs. SDSS, etc.) We can speed such request up considerably by pre-generating full cross-comparisons between these large catalogs and saving the results. Satisfying the user request is then just a matter of querying this pre-built table. Of course, such pre-built products can only be used where the user tolerance is less than that used to construct the product and if we have carried along all the fields in the tables the user may request. Since the result is absolutely identical to what we would have gotten if we had done the cross-comparison dynamically, all of this can be decided at run time by the service; the user need not care what method we use.

Scalable Cross-ComparisonWork at scale 7. The services must work at scale. 7.1. The service must support large (> 106 record) input tables and results. 7.2. The service must support asynchronous execution. As with Requirement 3 (tolerance limit), the count scaling is context-sensitive. A request that involves a pre-built comparison between SDSS and 2MASS is intrinsically dealing with several hundred thousand “sources” (and matches). In a lot of cases, the mechanics of dealing with uploaded input tables and results (say through a VOSpace) may be more demanding than the actual cross-comparison step. The million record number seems like a reasonable number for such use (i.e. an upload table size convenient for upload). With this scale of processing, we will frequently require processing times that are too long to reasonably be considered synchronous. Rather than confront the user with multiple modes (and no a priori way to determine which will be used) we will probably make the service intrinsically asynchronous. It is easy enough for users (programmers) to wrap such services to look locally like blocking calls and we will certainly do this in our GUI but this way we keep things simple yet fully flexible.

Scalable Cross-ComparisonUser control over N-way comparisons 8. The user must be able to control the means of handling N-way comparisons that can affect the comparison results. 8.1. The user must be able to choose the order that the comparisons are made. 8.2. The user must be able to choose what columns are used to represent the position resulting from a preceding comparison. The underlying cross-comparison engine operates, as stated, pair-wise. However, a common use-case scenario will be to generate an N-way join as a sequence of pair-wise comparisons. To facilitate this, it must be easy to loop the results of a previous step into the next iteration (e.g. the output of the initial 2-way as one of the two tables in the next-step “3-way”). An important part of this is deciding which coordinates in the input to use (there are two of these coming out of an initial 2-way and one more for each step if we keep them all).

Scalable Cross-ComparisonProgram-friendly interface and GUI 9. The project must provide interfaces to the service a at least two level. 9.1. The project must provide a low-level (program) interface compatible with general user scripting (e.g. perl, python, shell, IRAF, IDL, etc.) 9.2. The project must provide a simple but usable GUI appropriate for astronomers. The core service here has a basic HTTP POST/REST interface and is therefore pretty much directly useable from just about any programming environment. A basic useable interface can be constructed as an HTML form driving this service but as resources permit we will augment this with catalog selection utilities, job monitoring, result processing and visualization. However, these augmentations can neither be guaranteed nor is it appropriate to include them as requirements on the Cross-Comparison service.

Scalable Cross-ComparisonOutput compatible with other VO services 10. The form of the results from this service must be compatible with the input to other VO services, such as image cutouts and data ordering. This is a little open-ended but we want to ensure that what comes out of the Cross-Comparison service can be used as directly as possible by other VO services. We’ve said this explicitly with regard to reusing this output iteratively in the Cross-Comparison itself and since we emit VOTables we will be directly usable by programs like Topcat. Properly labeling columns (e.g. positional UCDs) should be enough to support SIA uploads for image cutouts. This requirement is just to ensure we keep on top of all such use cases.

Scalable Cross-ComparisonUse TAP and VOSpace 11. Where available, the cross comparison service must be able to utilize TAP and VOSpace services. 11.1. A TAP query URL can be used as a input catalog source. 11.2. A VOSpace can be specified as a location to find input tables and place to cross comparison results. We left this Requirement to the end not because it is unimportant but because our use of these technologies depends more on VAO getting instances of these services in place than on any action of ours. When there are TAP services to use it is easy to forward a user’s ADQL to it and capture the result table for Cross-Comparison use. When there is a reasonable volume of reliable VOSpace storage somewhere that we and the users have access to, we will retrieve input tables there and place our results there.

Scalable Cross-ComparisonRequirements Verbatim • 1. The service must produce cross-match candidates for a pair of tables without prejudging which is the "preferred" cross-identification. • 1.1. Given a pair of tables or references to get them, the service must produce a list of all sources matching within a positional tolerance. • 1.2. For all cross-match candidates, the service must report the distance, position angle, and user-specified fields from the two input tables. • 2. The service must accept input tables from multiple sources. • 2.1. The service must accept on-line catalog references as input tables. • 2.2. The service must accept user-uploaded files (VOTable and other TBD formats) as input tables. • 3. The service must support comparison distances large enough to account for the proper motions of Galactic (i.e. extra-solar) objects (TBD but several arcmin minimum). • 4. The service should provide grouping / sorting and hopefully filtering functionality for combined results. For instance, deciding between a group of potential matches is an evaluation of the distances for each match (preferably sorted by distance) and physical measures such as cross-catalog color. • 5. The service must have access to minimum set of important catalogs to be chosen during the design phase (with VAO Project management oversight). • 6. The service may leverage pre-generated pairwise comparisons when scientifically appropriate (either by user specification or based on requested radius). • 7. The services must work at scale. • 7.1. The service must support large (> 106 record) input tables and results. • 7.2. The service must support asynchronous execution. • 8. The user must be able to control the means of handling N-way comparisons that can affect the comparison results. • 8.1. The user must be able to choose the order that the comparisons are made. • 8.2. The user must be able to choose what columns are used to represent the position resulting from a preceding comparison. • 9. The project must provide interfaces to the service a at least two level. • 9.1. The project must provide a low-level (program) interface compatible with general user scripting (e.g. perl, python, shell, IRAF, IDL, etc.) • 9.2. The project must provide a simple but usable GUI appropriate for astronomers. • 10. The form of the results from this service must be compatible with the input to other VO services, such as image cutouts and data ordering. • 11. Where available, the cross comparison service must be able to utilize TAP and VOSpace services. • 11.1. A TAP query URL can be used as a input catalog source. • 11.2. A VOSpace can be specified as a location to find input tables and place to cross comparison results.

Scalable Cross-Comparison