1 / 16

Using SAS and Perl for Large Datasets

Using SAS and Perl for Large Datasets. March 21, 2007. The Strong Points of SAS. SAS is designed to handle large datasets. SAS language is robust PROC SORT PROC SUMMARY PROC SQL PROC REG. The Strong Points of Perl. Free Well-documented CPAN – Comprehensive Perl Archive Network

silver
Télécharger la présentation

Using SAS and Perl for Large Datasets

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Using SAS and Perl for Large Datasets March 21, 2007

  2. The Strong Points of SAS • SAS is designed to handle large datasets. • SAS language is robust • PROC SORT • PROC SUMMARY • PROC SQL • PROC REG

  3. The Strong Points of Perl • Free • Well-documented • CPAN – Comprehensive Perl Archive Network • Efficient at file handling.

  4. The Netflix Data • Download as a tar.gz file. • ReadMe file. • movie_titles.txt… • RMSE Perl script… • Training Set…

  5. Movie_Titles.txt 1,2003,Dinosaur Planet 2,2004,Isle of Man TT 2004 Review 3,1997,Character 4,1994,Paula Abdul's Get Up & Dance 5,2004,The Rise and Fall of ECW 6,1997,Sick

  6. Movies • Ziggy Stardust and the Spiders From Mars: The Motion Picture • Learning HTML: No Brainers • Godzilla vs. The Sea Monster • Frank Lloyd Wright • Rabbit-Proof Fence

  7. RMSE • Root Mean Square Error • S2 = (1/n) Σ (Xi – Xbar)2 where: • n = number of observations • Xi = ith observation out of n • Xbar = mean of X (or in our case the predicted X) • RMSE = square root of S2

  8. The Training Set • 17770 files

  9. The Training Set 1: 1488844,3,2005-09-06 822109,5,2005-05-13 885013,4,2005-10-19 30878,4,2005-12-26 823519,3,2004-05-03 893988,3,2005-11-17 124105,4,2004-08-05 1248029,3,2004-04-22 Movie: Person,Rating,Date Person,Rating,Date Person,Rating,Date . . .

  10. Data Marts • Data marts are subsets of a larger data set. • Contents are determined by the problem at hand. • Contents may change over time or remain static.

  11. Why Use Data Marts • Increases query performance. • Decreases storage costs. • Decreases risk. • Proving-ground for new code or equipment. • Allows for the optimization of effort toward problem solving instead of data management.

  12. Perl for Data Mart Assembly What we have… 1: 1488844,3,2005-09-06 What we want… 1,1488844,3,2005-09-06

  13. Perl2.pl while (<*.txt>) { $file=$_; print $file, "\n"; open(IN, "< $file"); while (<IN>) { print "$_"; } close IN; }

  14. Perl3.pl open(OUT, "> output.tx"); while (<*.txt>) { $file=$_; print $file, "\n"; open(IN, "< $file"); while (<IN>) { if( $_ =~ /[0-9]*:/){ } else { print OUT "$file,$_"; } } close IN; } close OUT;

  15. SAS Tips and Tricks • Getting started • OPTIONS OBS=0; • WHERE… • Avoid temporary data sets. • Make permanent SAS data sets. • Access data sets with SQL.

More Related