1 / 40

Analysis of Circulation Transaction Data for the UCLA Library: 1997 through 2003

Analysis of Circulation Transaction Data for the UCLA Library: 1997 through 2003. Robert M. Hayes. Preface.

sandralacey
Télécharger la présentation

Analysis of Circulation Transaction Data for the UCLA Library: 1997 through 2003

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Analysis of Circulation Transaction Data for the UCLA Library: 1997 through 2003 Robert M. Hayes

  2. Preface • The purpose of this presentation is to provide an overview of the analysis of data related to circulation transactions for the UCLA Library system during the seven years from 1997 through 2003. • It will briefly discuss the source data. It will then, equally briefly, discuss the means for processing of those data. • It will then discuss some of the results from the analysis.

  3. Summary • The Source Data • The Processing • The Results • The Conclusions

  4. The Source Data • Source Files • 1997-1999 Data Files • 2000-2003 Data Files • Source Formats • 1997-1999 File Formats • 2000-2003 File Format • Source Problems • 1997-1999 Problems • 2000-2003 Problems

  5. Source Files: 1997-1999 Files • Given the sizes of these files, I needed to split each of them into sub-files each containing about 60,000 entries so that I could process using Excel. The result was the following set of files that became the sources for all subsequent processing:

  6. Source Files: 2000-2003 Files • The source files were all txt files, each about 60,000 entries, but after importing each into Excel, they were then stored as .xls files, as follows:

  7. Source Formats: 1997-1999 Files • (1) The transaction codes are: C=Chargeout; D=Paid bill; F=Fine bill; H=Hold; N=States never checked out; P=SRLF page;R=Renew; T=States return; • (2) The date/time format is (YYMMDDHHMM) • Note that the actual file sizes imply 53 bytes per record, so there are two bytes that are record separators (i.e., CR,LF).

  8. Source Formats: 2000-2003 Files

  9. Source Problems: 1997-1999 Files • There were a few identifiable problems with the source data as such. Specifically, seven entries have the following ID Bar Codes, Transaction Codes, and Years: • L 766656896C98, from Col CR of X-Distribute 1998, in File arch98b-t06.xls • L 766656896T98, from Col CR of X-Distribute 1998, in File arch98b-t06.xls • L 767721954C98, from Col CR of X-Distribute 1998, in File arch98a-t02.xls • L 771471539C98, from Col CR of X-Distribute 1998, in File arch98b-t06.xls • L 776229833C98, from Col CR of X-Distribute 1998, in File arch98b-t06.xls • L 776229833H98, from Col CR of X-Distribute 1998, in File arch98b-t06.xls • L 787433333C99, from Col CY of X-Distribute 1999, in File arch99b-t00.xls • The problem with these seven entries is that I have no real way to reconcile what appear to be erroneous ID Bar Codes. They should each start with L 07 rather than L 7, but I do not know where to delete a character before the transaction code in order to bring things into alignment. • Beyond that, it turned out that there are significant numbers of missing transactions that “fell through the cracks” at the end of 1999 academic year when there was a transition from one circulation control system to the next one. How many are missing is simply unknown.

  10. Source Problems: 2000-2003 Files • There are 115,208 entries for which the format is incorrect. They are identifiable as entries of lengths 102, 103, 104, and 105. Basically, the Class field is 18 characters too far to the left and the Borrower field is 10 characters too far to the right. • Beyond that, 7720 items lack fields because the record terminates before those fields occur: 46 from Notes and Materials on; 958, from Class on; 4182 from Borrower on; and 2534 from Institution on. • Finally, two of the files – AA0002958619.txt and AA0008630766.txt – appear to have incorrect formats for the ID Bar Code, since the 000 should be just 00. All of the other fields are properly located (except for the other errors), but the ID Bar Code is one character too long. To illustrate from the first entries of files 28 and 29: • A0013689070 005010500C00030808091807-ALS-5304SRLURLBook • AA0002958619 0111010500C0109050102107-ANE-6792SRLURLBook • Note that the transactions code is one character to the right in the second entry. • The crucial point is that in the set of files for 1997-1999, there are NO entries starting AA0002. There are ONLY entries starting AA002.

  11. The Processing • Accounting Controls • Reconciliation of Format • Correction of Source Data • Analytical Programs

  12. The Results • Overall Circulation Transactions • Circulation Transactions by Years • Frequencies for Various Other Fields • Circulation by Type of Bar Code • Circulation by Type of User • The J-Shaped Curves • Circulation by Types of Materials • Extremely High Numbers of Transactions • Differences for Types of Users • Frequency by Type of Borrower • Type of Borrower by Type of File • Effects of Electronic Access

  13. Overall Circulation Transactions • There were 10,527,515 transactions recorded during the seven years but they involved only 1,825,384 items from the collection. • That is an average of 5.8 transactions per item. But, of those transactions, only 4,292,798 were circulations; the rest were largely renewals (although there were 326,386 other kinds of transaction, such as holds, fines, etc.). • And the circulations involved only 1,662,551 items. Thus there were 162,833 items that had no circulations during the period but instead involved renewals of items that presumably had circulated in some year prior to the renewal (or other non-circulation transactions). • There were 9,044 transactions recorded for years 1988 through 1996; they involved 8,233 items, but those are not sufficient to account for those zero circulation items. • Given that the size of the UCLA Library collection during this period was between 7.0 and 7.5 million, about 25% of the collection involved circulation systems transactions.

  14. Circulation Transactions by Year • The following table shows the total number of circulation transactions by year (as recorded in the transaction itself): • The transactions that reported the years from 1988 through 1996 would seem to be anomalous, and I have no way to account for them. They are included in all of the subsequent analyses, however.

  15. Circulation by Type of Bar Code • There are three types of entries in the source files: • (1) those with bar codes beginning “31158”, • (2) those with bar codes beginning “L 00”, and • (3) those with bar codes beginning with alphabetic characters other than “L”. • The first are bar codes created for the CLSI system and thus presumably represent items acquired before and during the time of operation of the CLSI system (i.e., prior to 1988). The second are for bar codes created during the operations of the subsequent Orion, Taos, and Voyager systems (i.e., subsequent to 1987). The third are for items held by the SRLF, and I have no easy way of identifying the time period for acquisition of those items, either by the UCLA Library or by assignment to SRLF.

  16. Aside on SRLF • The Southern Regional Library Facility (SRLF) is one of two “depository facilities” established by the University of California for the purpose of storing “seldom used materials” from the 9 campuses of the University of California and the 23 campuses of the California State University. (The other, of course, is the Northern Regional Library Facility.) The SRLF is located physically on the UCLA campus. • The process for determining what materials are removed from the campus libraries and placed in the SRLF (because they appear to be seldom used materials) are designed to assure that the needs of faculty and students will continue to be effectively met. In particular, there are means for assessments by the users as well as by the library staff of the potential utility of materials that appear to be candidates for assignment to the SRLF. • This is especially important because the primary basis for assessment of what are seldom used materials is the data on circulation since there is at best sparse data on which to assess total use, including in-house use. • However, although in principle the regional library facilities are intended to house seldom used materials and that is the predominant utilization of them, they are also used for other categories of material (such as portions of special collections).

  17. In 1988, the UCLA entry in the ARL statistics showed total holdings of 6.0 million volumes, so I am going to take that as the potential total population of 31158 items. The question is the extent to which those have been assigned to SRLF. The total assignments to SRLF appear to be about 2 million items; a reasonable assumption could be that all of them are 31158 items, and I will so assume. Given that assumption, the following is the distribution of holdings, circulation transactions and related items among the three types of items: It is of some interest to note that the results show a substantially greater frequency (about 1/3 greater in fact) of circulation for materials at SRLF than for those in the general collection with bar codes starting 31158. And any departure from the assumption that all of the SRLF items are 31158 items would simply decrease the ratios for the 31158 holdings. In any event, the crucial point here is that, on the assumption that older items would be less used than newer ones, one would expect the 31158 items to be less circulated than L 00 items, and that indeed is the case. On the assumption that items are assigned to SRLF because they are expected to be less used items, one would expect them to be less used than either 31158 items or L 00 items, and that is not the case.

  18. Conjecture: Circulation Use Plus In-House Use • However, there is a fact and a related conjecture that may not only explain what is happening here but provide some insight into several other issues. • The fact is that SRLF is a closed collection, so that any use of materials from it requires a circulation. In contrast, the materials stored elsewhere (with the exceptions of various special collections) are open stack so that in-house uses of them will not entail a circulation. • The conjecture, therefore, is that the circulation of SRLF materials represents a combination of circulation uses and in-house uses. • In general, my rule of thumb is that in-house uses are twice the number of circulation uses of materials. Of course, there is no reason to expect that such a ratio would apply to the in-house use of SRLF materials, but assuming that it did, the data become quite consistent with the rationale for storing materials at SRLF. • One of the other issues this may provide insight into is represented by the following table:

  19. For the moment, let’s assume that the mix of Books and Journals is the same for all three types, that the average total use is three times circulation use, and that the ratios of Book to Journal circulations were similar between 31158 and SRLF. • Then, the SRLF uses (which total 410,600) would represent 136,867 circulation uses (i.e., 1/3 of 410,600). They would have been divided, given that the ratio of Journal circulation to Book circulation is 0.113 (i.e., 70555/625044), between 122970 Book circulations and 13869 Journal circulations. • And that implies that in-house use of Books would be 1.6 times circulation use of Books, while in-house use of Journals would be 5.5 times circulation use of Journals. • This at least provides a working hypothesis for examination of in-house uses of Books and Journals.

  20. Circulation by Type of User • Since I will be focusing on the different patterns of circulation use of the collection by various types of users, let’s look first at the beginning for the distributions of frequencies across the types of users: • The graph of these distributions is a useful means for picturing them:

  21. It is especially interesting to note how rapidly the graph for “Dept” drops. I have been told that the “Dept” data represent transactions by the several branch libraries, primarily to identify items that are, for one reason or another, removed from circulation. It is therefore not surprising to see that effect.

  22. The J-Shaped Curves • Of course, as is to be expected, the frequency of circulation of individual items follows a J-shaped curve. And there are several such frequency distributions that I have examined. I will not here bore you with the description of the array of distributions, but I will simply illustrate with the most overriding of them, the overall total frequency distribution. • I have found, in prior analyses of these kinds of data, that they can be well represented by a “mixture of Poisson distributions”, and I have an algorithm that I use to estimate the nature of such a mixture for a given set of data.

  23. Mixture of Poisson Distributions • A "mixture of Poisson distributions" is characterized by a set of parameters {n(i),m(i)}, where n(i) is the number of items in component(i), and m(i) is the "a priori expected frequency" with which an item in component(i) will circulate during a given time period. This mixture leads to a frequency distribution based on the following formulation: Fj(k) = n(j)*e(m(j))*(m(j)k)/k! k = 0, 1, 2, … P F(k) =  Fj(k) j=0 where P+1 is the number of components, and F(k) is the number of volumes that the model predicts will circulate exactly k times in the given time period.

  24. Mixture of Poisson Distributions • An equivalent formulation is Fj(0) = n(j)*e(m(j)) Fj(k) = Fj(k-1)*m(j)/k k = 1, 2, … P F(k) =  Fj(k) j=0 which is useful since it avoids the problems of factorials for large values of k.

  25. Algorithm for Estimating a Mixture • Let D(k) be a distribution, where D(k) is the number of items, out of a total of N, that occur exactly k times, k varying from 0 to L. Let {N(j), M(j)} be a mixture of Poisson distributions, j = 0 to P, with M(j)< M(j+1). The mixture is intended to provide an initial approximation to the given distribution D(k). • Calculate for all k for which D(k) is not suppressed (D(k) being suppressed when it is unknown or, perhaps, when k = 0): L Nj' = Nj + Fj (k)* (D(k)-F(k))/D(k) k=0 L Rj' = Rj + Fj (k)*k*(D(k)-F(k))/D(k) k=0 Mj' = Rj'/Nj' • Iterate, replacing {Ni, Mi} by {Ni' , Mi '} until a desired degree of convergence is reached. • The following is a sample from the beginning of the overall frequency distribution and the mixture of Poisson distributions that I derived by applying that algorithm:

  26. This probably requires some explanation. The first column is the number of circulations; the second column is the data for the actual frequency with which items have a given number of circulations; the third column is the sum of the final six columns and is the mixture of those six Poisson distributions. In the final six columns, the second row is the number of items in each group; the third row is the average frequency with which items in the groups will circulate; the remaining rows show the expected frequency of circulation for items in each group. For example, group 2 contains 1,491,892 items with an average circulation of 1.015. Its Poisson distribution estimates that 540,614 of the items in group 2 will not circulate, that 548,775 will circulate once, etc. The frequencies for group 6, of course, do not begin to appear until 14 circulations.

  27. I think you will agree that the match between the actual Data (D) and the Mixture (F) is reasonably close. • It is important to note that the mixture provides an estimate of the number of items that do not circulate, even though there are definite probabilities for them to do so, and thus allows us to project the data from those that do circulate to the virtually full collection. Indeed, the total of items in the six groups is 7,411,064, to be compared with the actual collection of 7.5 million. It is estimated that 5,750,426 will not circulate in a seven year period • What does this all say? Given that the UCLA collection during the seven year period in which these circulation transactions occurred was between 7 million and 7.5 million items (volumes), as reported to the ARL, the mixture essentially accounts for that collection. The average frequency of 0.033 in seven years for group 1 might entail many future seven year periods for a given item in it to be circulated, but still that probability is not zero.

  28. Circulation by Types of Materials • Before pursuing the issues related to the types of users, there are some data concerning the types of material that I would like to comment on. The following table shows the number of transactions for them: • A brief explanation: Books are simply that; Journals are both bound and unbound; x is unknown, for a variety of reasons; Eqpt is equipment, of a variety of kinds; Music is a variety of forms for musical materials; AV is audio-visual materials; Special is materials, such as manuscripts, from special collections; Microform is a variety of microforms; and Computer is computer related materials, such as CDs.

  29. The significant point to me is the ratio of Books circulation transactions to Journal transactions. Books constitute just about 90% of the circulation transactions for the Books and Journals taken together. But within the collection, Books and Journals are probably close to equal in number of volumes, and in the acquisition budget books are probably less than 25% of the budget. • Now, it must be said that circulation data are not “use” data and that Journals may well be more heavily used in-house use than in circulation use. Indeed, that possibility is potentially supported by the working hypothesis discussed above at the conclusion of the discussion of Circulation by Type of Bar Code. To recall, the working hypothesis is that in-house use of Journals might be 5.5 times circulation use, while in-house use of Books might be 1.6 times their circulation use. • Still, though, the difference between Books and Journals here is quite dramatic.

  30. Extremely High Numbers of Transactions • Another set of data I would like to comment on, before pursuing further issues related to the types of users, concerns the items that have extremely high numbers of transactions. There are 4490 items each of which had more than 100 circulation transactions during the seven year period, for a total of 771,057 transactions, an average of 172 each. But beyond that, there were items with over 2000 transactions in the seven years! Now, the great bulk of those transactions were renewals. Indeed, there were only 819 items with circulations and, for them, there were 143,401 circulations, an average of 175 per item. • I have not yet pursued analysis of the data concerning the items that entail such high numbers of transactions, but eventually I may do so simply as a matter of curiosity.

  31. Differences for Types of Users • I am now going to turn to what, for me, is a most important set of results. I am going to focus on three groups of users: Faculty, Graduate, and Undergraduate. It seems to me that they represent three different mixes of objectives. I assume that, for faculty, the dominant objective in borrowing materials from the library is support to research; that, for Undergraduates, the dominant objective is support to instruction; and that, for Graduates, the objectives of support to instruction and research are each present, though support to research is likely to predominate. • In order to compare the relative use of material by these three types of users, I will normalize by the respective sizes of the populations. Of course, the actual sizes of these three populations varied during the seven years and even within each year. So for simplicity, I will take the number of Faculty as 1,800, the number of Graduate students as 11,000, and the number of Undergraduate students as 22,000. Obviously, these numbers are not exact, but I think they are close enough for the immediate qualitative purposes. In any event, it would be trivially easy to change those estimates for size of population; doing so for any reasonable replacements would not change the qualitative picture I will present. • I will present two pictures. The first is the overall, and the second compares sets of items based on a priori expectations of circulation.

  32. Frequency by Type of Borrower • I will now focus on Circulation transactions. This graph shows the frequency per borrower (i.e., data for each type of borrower is normalized by the number of borrowers of that type) as a ratio to the frequency for undergraduates. • The rate of decline for Faculty and Graduate was such that it seemed to be logarithmic, and indeed it was so, as clearly shown in the second graph. • In any event, the picture is clear, I think. The less frequently materials are circulated, the more likely that they will be circulated to Faculty and, to a lesser degree, to Graduate students. This implies to me that the less frequently used materials are used for research purposes rather than instructional purposes. • On the surface of it, that would seem to be self evident, and indeed that was my reason for conjecturing such a relationship. And the data clearly support that conjecture.

  33. Type of Borrower by Type of File • Indeed, the conjecture is even clearer when we examine their respective uses of the three types of file. • The patterns related to the frequencies of circulation are even more dramatic:

  34. These are items that are used on the average once a year or less during the seven year period. Note that the use per Faculty member is consistently substantially greater than that by individual students. Note the dramatic difference in the use of SRLF items. Further note the similarity in the use of 31158 and L 00 items. The implication to me is that the criteria for selection of materials for assignment to SRLF appear to be quite good.

  35. Potential Effect of Electronic Access • What might be the effect of online access to electronic journals by UCLA Library uses? The issue at hand is the extent to which data about the online use might be correlated with the data for circulation use. • I conjectured that the circulation data could show an identifiable DECREASE in circulation use for materials that potentially were available online. I therefore ran the following analysis: • (1) The circulation barcodes clearly show a likely year in which the related materials were acquired, at least for the past four years. • (2) Based on that fact, I accumulated the data for frequency of circulation for books, on the one hand, and for journals, on the other. • (3) I then allocated those circulation data to the years in which they occurred. • (4) I then calculated the average circulation (for books and, separately, for journals) per year for the years during which each barcode showed circulation. • (5) I then calculated the ratio each year for the average circulation of journals to that for books. • The results are I think quite revealing:

  36. The picture is even more revealing, even startling, in the context of the historical pattern: • The SRLF ratio I think reflects the combination of in-house use with circulation use. • The steady growth in the ratio of Journal Circulations to Book Circulations (from the pre-1988 materials, labeled 31158, through 1999) followed by the dramatic drop in 2000 and the steady decline thereafter certainly suggests that something happened between 1999 and 2000.

  37. A crucial point is that, while the year to year use of journals acquired in years 2000 to 2003 showed a definite pattern of decrease each year after acquisition, the comparable use of books was relatively even. • I think one can reasonably conclude that the online availability of recently acquired journals has resulted in less circulation of the print copies, and measurably so. In fact, the level of use in 2003 has decreased to half of what it had, on the average, been in 1997 - 1999. • There is much that I would like to do in correlating the data for electronic use of materials with the data at hand for circulation use of print materials, but to do so would require access to such electronic use data. Which, alas, I do not have.

  38. Conclusions • My primary conclusion and the only one I will discuss here is that items with a low frequency of circulation are, despite that low frequency and perhaps even because of it, of great importance, especially for research purposes. • In prior studies, the Pittsburgh study of some decades ago being the prime example, the fact that there were “little used materials” in a research library was presented as a serious problem, even a deficiency. • I have long felt that such was an erroneous conclusion, so the opportunity given me to examine the “little used materials” in the UCLA collection has made it possible for me to explore their nature. • I started with the conjecture that little used materials indeed were valuable, and I think the results I have presented here support that conclusion.

More Related