THE RESEARCH COMPONENT OF A MODEL INFORMATION TECHNOLOGY (IT) COLLEGE

William Mitchell (PI), R.H. CyberCollege of Arkansas

Nicholas Karlson (co-PI), Department of Information Science

Ningning Wu (co-PI), Department of Information Science

University of Arkansas at Little Rock

Goals of the Project

The broad goal of the project is to isolate those environmental and behavioral factors that inhibit or increase the success of women and minorities in IT related majors at UALR (success being defined as graduating and obtaining jobs in IT related fields). The project outcomes will include a better understanding of what data sources reveal these factors, a model database that organizes the relevant data, and techniques for mining the database to:

· test conclusions derived from demographic and pedagogic research from other studies.

· discover new associations and clusters of experience and behavior patterns that will suggest research in new directions.

· provide data on the IT students of Arkansas and on the efficacy of the various strategies that the CyberCollege has implemented to achieve its mission.

· prototype recruitment and retention databases that could be installed in other IT colleges.

Once built, this dynamic data model will be sustained within the college and the data that it collects and the analysis that it engenders will provide great insight to the longitudinal problem of maintaining the supply of college-trained professionals in the IT workforce nationally.

Additional goals of the project (along the above lines) include the identification of factors that indicate an at-risk student. This information would be useful for planning intervention programs to prevent student dropout. The project will require a set or subset of the following activities: database creation, data warehousing, data mining, statistical surveys, and web-based data collection (using a database-backed website).

Main Activities to Date

Decision Tree Algorithm Discussion

A basic goal of this project is to analyze data in hopes of uncovering factors that affect women and minority student success in the CyberCollege (and, by extension, all other higher education IT programs).  Therefore we have been assembling data, data-analysis methodologies, and data analysis tools.

A. Data: We have within the college and the university an enterprise database and several special databases.  We are also in the process of using the web to collect data from students.  Institutional Research has shared a multi-year cohort sample that another faculty taskforce has been studying.

B. Data analysis methodologies: We have said that we will use data mining techniques with added statistical techniques as needed.  However, we have also explored building a model using Stella, a package used for dynamic modeling in the Systems Engineering Department.   Dr. Mitchell has prototyped an interactive retention model based on historical enrollment data of the Systems Engineering Department.

C. Data analysis tools: We have proposed to initiate our own tools based on Python with modules written in Python/C/C++ . Why? More control/understanding of the tools. Our review of some publicly accessible tools, for example, Weka - written in Java,  shows that they may be slower than we would like.  We will use the general practice, however, of employing multiple tools in data analysis. Another good package that is ready to go is R. R has many useful libraries written in C, but the programming language is not as elegant as Python. In sum, we have focused on the speed of C, the elegance of Python, and the verification with additional tools to insure proper data analysis.

Software and data have been implemented on the Linux platform. The content management system Zwiki (which is developed using Zope) has been installed and configured for group work on the project. A benefit of starting with Linux will be a reasonably easy port to Solaris if needed (the UALR Information Science Department has a number of powerful computers with this OS). A significant data set including years from 1992 to 2002 consisting of thousands of students enrolled at UALR has been obtained with the help of UALR’s Institutional research. This data includes demographic data, test scores, and student withdrawals. It is an extensive and rich data set that is in the process of being cleaned for our use. We have been working with the well-known Decision Tree algorithm C4.5 to use cooperatively with the Python programming language and the MySQL database management system. We have also worked on web-based collection software using the Perl programming language, MySQL, and the Apache web server.

Main Results to Date

So far we have successfully implemented the C4.5 algorithm with Python and MySQL under Linux. Four students have been involved in the installation and refining of an interactive web-based data collection system. Some of the techniques used in this system are discussed in the paper "E-MAIL PROGRAMMING WITH PERL", The Journal of Computing Sciences in Colleges, vol. 18, (April 2003), p. 86-93, written by Dr. Nicholas Karlson and Charles Reynolds (an undergraduate research assistant working on the project). We have obtained a significant data set as mentioned above that we are loading into MySQL. Dr. Mitchell has been collecting and analyzing grade data and developing an attrition model for the CyberCollege. He has cross-validated the data extracted from final grade reports with data collected by the Office of Institutional Research. He has focused on longevity patterns of enrollment, self-described major intention of students in CyberCollege courses, and prerequisite observance. The outstanding finding so far is the significantly higher attrition rate of students who declare themselves undecided in freshman courses across all the program of the CyberCollege. This pattern is documented in eight consecutive semesters of data. Dr. Mitchell has developed an attrition model in Stella, a dynamic modeling package that incorporates entering mathematics level and persistence ratios for major courses each year, based on the historical data. The model allows department chairs to experiment with the predicted affect of different mixes of freshmen competence and different size freshman classes and numbers of transfers on graduation numbers. A paper on the model is being constructed for a statewide conference on educational technology.

Next Steps Planned

The next step in this project is to start analyzing data with an expanding library of data mining and statistical tools. Besides the C4.5 tool, more statistically based classifiers, e.g. Instance-based classifiers will be considered. Instance-based classifiers have an intuitive appeal. In short, they categorize new cases by using passed results of similar cases.

Key Open Research Questions

Most research questions are still open as data analysis is still in the early stages. As we get further into the data analysis phase, answers to issues posed by the Goals of the Project, will begin to take shape.