Archive for the ‘Problems’ Category

Correlation 101
Wednesday, November 2nd, 2005

I’m using this in work, and I’m having to explain it a lot, so I’m writing this entry to help myself clarify and simplify my personal understanding of this principle, as well as to have a place to send people who ask.

Pearson’s Correlation Coefficient (r, for samples) is the degree and direction of relation between two variables and ranges from perfect positive correlation (1.0) through no correlation (0.0) to perfect negative correlation (-1.0). If the correlation is positive, the first variable increases or decrease as the second increases or decreases. If the correlation negative, the first variable increases as second variable decreases and vice-versa.

Correlation can be understood, to an extent, by rank order of the first and second variables. If Johnny and Jane score 75% and 80% respectively on a pretest, and then 85% and 95% respectively on the post test, r would be a perfect 1.0 because Johnny’s score was lower in both cases. This would hold true even if Johnny scores 94% (still lower than Jane’s) on the post test, but not if his post test score exceeds Jane’s. At that point, r is a perfect -1.0. With only two cases, r will always be 1.0 or -1.0, but of course, such clear cases are rare.

Extreme r can be visualize through plotting a line for each case from their first variable value to their second variable value. Here are three graphs for simplistic 1.0, 0.0, and -1.0 correlations. It’s a bit more difficult to spot non-extreme correlations with large numbers of cases:
166 cases, r= .71
166 cases, r= -.70
166 cases, r= -.13

For a real world, example, we looked at two variables (intensity of study and score on admissions test) for 166 Middle East language students. We found there to be basically no correlation between the two variables (r=-.04) Here’s the graph, and a close up of the chaos.

Notes:

It’s cliché, but correlation does not mean causation. The sun coming up may correlate with an increase in sewer pressure, but the sun has little direct effect on this. All r indicates is how to variables relate to each other, not what causes this.

I once heard an M.D. on the news say, “80% of type-2 diabetics are obese, so it doesn’t take a Ph.D. to see the connection.” Well, actually, I guess it does. A Ph.D. (sh|w)ould point out that the two conditions may actually be caused by a third, present in both cases. (I’m not saying obesity doesn’t lead to diabetes, just that this statement was not a good argument for it.) Sadly, we, as a society are conditioned to think in terms of percentages, “grade levels”, etc. And very few of us are not convinced by their face validity.

Correlation cannot be determined if the sample size is 1, because the standard deviation of a single sample is 0 and to compute correlation, one must divide by the standard deviation, and one cannot divide by 0.

r can be tested for significance, but its square is often considered more important because it can be interpreted as the percentage of variance in the second variable that can be accounted for by the first (and vice-versa). Thus, r2, is called the coefficient of determination.

Bad data is bad data
Friday, October 28th, 2005

Last Summer, the Social Studies Research Council, who is charged with evaluating Title VI-funded programs, decided they could do their job better if they had access to the EELIAS database. EELIAS, started during the 2000-2001, is a database in which all programs funded under Title VI are to log their enrollments, expenses, etc. It’s a good idea, but poorly executed.

First, they contracted out the creation and maintenance of the system to a private company. Now, whenever the DOE wants some data, they pay that company to produce views and aggregations. This company has it pretty sweet. There doesn’t ever seem to be any assurance of data integrity, and there is no oversight on the accuracy of the data as provided by the schools.

So, the SSRC wanted the data for what it was supposed to contain, but didn’t know what to do with it. As we are also interested in enrollment trends, we said we’d take a look at it for them. So, they handled the FOIA request for the data, and sent me a 115MB Oracle dump file.

After weeks of work, I got what I could out of it. Some schools listed sections of Arabic 101 under the language “Other”, while some reported multiple sections with zero enrollment. My first look at incoming grants listed the average grant amount as $1.70. (Some data entry people had mixed up the fields for number of grants and total amount.) Finally, I thought I had something to show, though it was of no use to us.

I went to New York to present what I had found, and the difficulties we had finding it, to the SSRC. I had a slide that said, “This data was never meant to be looked at.” It seemed that some committee had mandated that Title VI funds be accounted for, so they set up a database to record, but didn’t build it to ever be used. Even the data I had was not acceptable to the SSRC, and I couldn’t blame them. Most of the issues were called out by an astute sociologist from the University of Chicago, who kept asking what the definitions of the terms were. Terms like “lecturer” mean different things from school to school, as does “enrollment”, and there was no clarification provided on the EELIAS form.

The SSRC was very gracious and thankful for my time. They asked me to write an appendix for their report detailing the shortcomings of this database. Such a chapter would be an ace up their sleeve should any opponent use EELIAS data against them.

Upon returning we sent out Fall enrollments to some school where we know people, just to see how accurate it was. Only one out of the five schools found any errors. Then we realized that we had sent out the data with the wrong years. (The reports are filed after each academic year, so Fall 2000 would be filed during the Summer of 2001. The dates on the data were the dates the reports were filed, so Fall 2000 would have been indicated as Fall 2001.) Sure enough, that the numbers from the one school that did find inaccuracies lined up much better once we fix the years, but the fact that four out of five schools reported that the data was accurate when, in fact, it was not, demonstrated to me the source of much of EELIAS’ error: The schools didn’t care about entering accurate data the first time, and they didn’t care when we asked them to verify the accuracy.

I wonder if it’s unethical to submit data for verification, but include some known inaccuracies as a check against the verification.

Why so quiet?
Saturday, November 29th, 2003

In case anyone out there wonders why this site has been lacking in updates, it’s simply because I’ve been very busy this semster. I’m taking a full load and working three part-time jobs, so you’ll understand.

To top it off, one of my new jobs has some really “sensitive” people. Trust me, as soon as I’m dont there, my gripes section will be plein à craquer!

Otherwise, things here are good.

Crash Explorer
Monday, May 5th, 2003

Just in case you’re bored, here’s a simple way to crash MS Explorer 4 or later (on PC only):

Just put <input type> before the BODY tag.

Try it: http://clovis.byu.edu/crashIE.html

Day from |-| E |_ |_
Friday, February 28th, 2003

The last 24 hours have proven rather frustrating. It all started around 3PM yesterday when Justin received notice via his WindowsXP laptop that there was an IP address conflict between his and another system on the network. It didn’t last long, and he was back up and running after a minute or so, so we didn’t think much of. Maybe it was a hiccup in the DHCP server.

The daily DVD for Arabic was especially important yesterday because there were listening activities on the DVD that were to be used on the midterm today. For some reason Marshall had not come in to author it by the time Brooke and I left to pick up Ben. We I called from her parents’ house, and decided we could stop by work for a half hour on the way home to get the DVD done.

Well, when we got there we found Zola, our main server, acting up. Now Zola is a rock-solid Linux server and every problem it has ever had have been hardware problems. (These including a couple of painful months of a bad network card. BTW, if you swap out a NIC to see if that’s the problem, be sure to swap it out for a different brand.) These symptoms seemed to point to a problem with Apache, the web server software, but I knew better.

I got everything working before leaving late last night, but when I got home, I was not able to get any webpages fomr Zola. I couldn’t telnet either. So I telneted to Clovis – who sits next to Zola and is on the same switch. I could then telnet from Clovis to Zola and voilà! Even without restarting Apache, everything suddenly worked.

I had to go into work early to get the DVD to the TA before the 8AM class, so I decided I would work on it the next day (today). Well, after hurry to catch the bus to be here on time, I delivered the DVD to the classroom only to have the TA say that she was giving her midterm tomorrow. To make matters worse, Zola was not talking to the network again.

I was getting really suspicious, so I ran a port scan on Zola. See if you can find what’s wrong with this output:

jeremy@marquez:~> nmap zola

Starting nmap V. 2.54BETA22 ( www.insecure.org/nmap/ )
Interesting ports on zola.byu.edu (128.187.176.26):
(The 1539 ports scanned but not shown below are in state: closed)
Port       State       Service
135/tcp    open        loc-srv
139/tcp    open        netbios-ssn
445/tcp    open        microsoft-ds

Nmap run completed -- 1 IP address (1 host up) scanned in 1 second
jeremy@marquez:~>

Notice the “microsoft-ds”? on a Linux box?

Armed with this information, I fought my way through IT Service’s font line and got someone who knew what they were talking about. After a couple of minutes the guy asked my jack and room numbers. “Jack 1051 in room 110,” I said. “Oh. Hmmm. We’re showing Zola, with the correct IP address, connected to jack 2048 in room 210.”

Aha!

So after a few phone calls, we had the computer support rep (CSR) for the offending office open the door and disconnect the system. The faculty member said one of his students must have done it and that he would let them know that this was unacceptable. I sure hope he does more than that. Idiots.

In the meantime, I was trying to get a cheap wireless card to run on either my Win98 or Linux partition on Pepin. No luck.

Finally, I got a call that Zola was down again. Now, this is after everything that happened in the morning, so I cut class to see if I could fix it. Following some advice from the Unix Users Group, I ran /sbin/route and saw that somehow my default gateway had changed.

Finally, everything works the way it did before. However, I am now a day older with nothing but the experience to show for it.”"