In a post titled Data Mining – A Cautionary Tale, I share the idea that data mining can be dangerous by sharing the story of Cornell’s Brian Wansink, who has had multiple papers retracted due to various data mining methods that aren’t quite ethical (or even correct).
In the article, Gary writes of a time that noted physicist and Nobel Laureate Richard Feynman gave his class an exercise to determine the probability of seeing a specific license plate int he parking lot on the way into class (he gave them a specific example of a license plate). The students worked on the problem and determine that the probability was less than 1 in 17 million that Feynman would see a specific license plate.
According to Smith, what Feynman didn’t tell the students was that he had seen the specific license plate that morning in the parking lot before coming to class, so the probability was actually 1. Smith calls this the ‘Feynman Trap.’
Whether this story is true – I don’t recall ever reading it from Feynman directly – (although he does have a quote about license plates), its a very good description one of the dangers of data mining — knowing what the answer will be before starting the work. In other words, bias.
Bias is everywhere in data science. Some say there are 8 types of bias (not sure I completely agree with 8 as the number, but its as good a place to start as anywhere else). The key is knowing that bias exists, how it exists and how to manage that bias. You have to manage your own bias as well as any bias that might be inherent in the data that you are analyzing. Bias is hard to overcome but knowing it exists makes it easier to manage.
The Data Mining Trap
The ‘Feynman Trap’ (i.e., bias) is a really good thing to keep in mind whenever you do any data analysis. Thinking back to the story shared in Data Mining – A Cautionary Tale about Dr.Wansink, he was absolutely biased in just about everything he did in the research that was retracted. He had an answer that he wanted to find and then found the data to support that answer.
There’s the trap. Rather than going into data analysis with questions and looking for data to help you find answers, you go into it with answers and try to find patterns to support your answer.
Don’t fall into the data mining trap. Keep an open mind, manage your bias and look for the answers. Also, there’s nothing wrong with finding other questions (and answers) while data mining but keep that bias in check and you’ll be on the right path to avoiding the data mining trap.
In my doctoral research, I’ve been researching ways to improve knowledge capture and sharing methods, specifically within project teams but the ideas can be dissemenated around the organization.
One of the biggest issues I’ve found while working as a consultant is the amount of knowledge that I walk away with after a project is complete. Sure, I try to share this knowledge in every way possible but converting tacit (i.e., internal) knowledge to explicit (i.e., external) knowledge is one of the most difficult things to do.
Let’s assume though, that some portion of the knowledge that I hold in my head is converted into some form of writing at various periods throughout a consulting project. Where does that explicit knowledge live? In an email? In some document stored on a server? In a knowledge repository somewhere?
In the past, this problem has been attacked using centralized knowledge repository platforms. These systems require users to log in and ‘enter’ their knowledge into the system. Many of these platforms have been well built and some have been successfully used in organizations, but the success stories are far outweighed by the stories of KM repositories sitting idle and unused.
So…how can we get that tidbit of knowledge from my brain into some form of knowledge repository without me logging in and ‘entering’ it into the system?
Web 2.0 as knowledge repository
The use of Web 2.0 tools (blogs, IM, wikis, etc) has become ubiquitous.. If incorporated into a project environment, these tools might allow an easy and efficient method for capturing and sharing knowledge throughout project teams and project organizations.
The key to retrieving knowledge from tools is to make the user experience as seamless as possible. For example, an employee creates a blog on an organization’s intranet and then uses this blog to write different topics, some that pertain to her project and some that don’t.
Perhaps this employee is participating in two projects within the organization and she writes about topics that might be of interest to a portion of the organization and project team members. While she writes about interesting topics and at times, writes about her experiences on the projects that she’s worked on, perhaps her blog posts aren’t widely read. This employee has attempted to convert a portion of her tacit knowledge to explicit knowledge but few people on the project team or within the organization find this knowledge because its tucked away in the intranet site (which is rarely used anyway).
In the above scenario, knowledge was converted from tacit to explicit but few people are able to absorb this knowledge and make it their own (i.e., perform the conversion from explicit to tacit knowledge). What would happen if this knowledge were indexed, searched and shared with the rest of the project team in something akin to a project knowledge ‘journal’?
Since Web 2.0 platforms are ubiqutious, why can’t we use these tools as our knowledge repository? Employees and project team members are already using them…so can we find a way to ‘mine’ these platforms for knowledge?
Could a system be built that ‘mines’ these web 2.0 platforms along with other unstructured data (documents, email, etc) to ‘build’ a knowledge repository available to the entire organization?
Mining for Knowledge
I’m currently looking at ways to use text mining methods and techniques to mine for knowledge. Text mining looks to be a good approach to solving this problem because it allows for knowledge to be gathered without additional work by project team members.
There are other approaches that could be used for gathering knowledge from project team members, but all require additional work to input information. For example, a project team using a manual approach could ask team members to regularly update their blog and to ‘tag’ their posts with a special project tag or keyword so that a non-intelligent aggregation system (RSS, etc) could simply pull these tagged posts into a central repository. While this is a good approach, it relies on the end-user to tag their content correctly, accurately and in a timely manner. Tagging, and other categorization and taxonomic approaches, require the user to do something to allow their knowledge contribution to be categorized, indexed and found by aggregation systems and other users.
Using text-mining methods against pre-existing tools and platforms takes away the human fallibility issues found in current knowledge management repository platforms or by requiring a user to ‘tag’ a piece of content correctly as described above.
Using text-mining and other data mining approaches, I’m looking at ways to build semi-autonomous systems to index and organize both structured data and unstructured data pulled from blogs, email, IM, social networks, documents, spreadsheets and any other location / data sources. This system could aggregate knowledge found via text mining and social network analysis and build a project knowledge ‘repository’ that will contain all knowledge for any specific project. This repository will be searchable and will contain both manually curated content (e.g., content uploaded by project team members) and automatically curated / generated content based on text-mining and indexing techniques.
There are some major privacy issues here of course. How can you mine a users email and find the relevant knowledge without truly invading their privacy? Not sure you can but I’m looking at it.
Which of these two sources of knowledge would you trust to be more accurate?
The same can be said of knowledge captured and shared within an organization. How do you know that the white paper on your new API is true? Is it because it was released? Is it because of the author(s) of the paper? What if you had a knowledge-base generated by an autonomous agent using text-mining techniques…how would you know to trust the information contained in it? Who wrote the content? Were did it come from?
This is where trust comes into play. If you could ‘see’ the qualifications of the author or authors of the knowledge base articles would you trust the content more? If I knew that the worlds leading authority on organizational behavior wrote the Wikipedia article on the subject, I’d tend to trust that article more.
This is another aspect of my research…building trust into the mined knowledge using social network analysis (SNA) methods & techniques. Using SNA techniques, can the background, profiles, connections and knowledge of the users within an organization be automatically (or semi-automatically) generated to provide some form for initial trust metric to show that mined knowledge can be trusted?
I don’t know if it can…but I’m looking into it 🙂
So what are the next steps for me and this research?
I’m working on a research paper now that I hope will outline the research in more detail.
The entire book is based on showing the reader how organizations are using statistics, data mining and regression analysis to determine how to better run their businesses and/or get more money from you. The book is not too technical nor full of numbers and the author writes the book for the non-technical/non-geeks out there.
What I found most interesting about this book was the ‘behind-the-scenes’ details of how companies like Wal-Mart are using data mining and other techniques to model and manage their logistical systems.
Ayers also provides some very interesting (and slightly disturbing) anecdotes about the use of these methods by Casinos to ensure that gamblers don’t lose cross their ‘pain threshold’ while gambling (this threshold is calculated based on various statistics about the gambler). The casino will nonchalantly ask the gambler if they’d like to receive a free dinner…this isn’t really to ‘comp’ the gambler…its just to make them forget about the money they’ve lost.
Another interesting/disturbing example shows credit card companies using data mining and modeling techniques to ‘get the most from’ their customers.
This book is a fun read and one that I think everyone should pick up. It is a purely non-technical book on the subject of data mining, modeling and statistical analysis and is full if interesting nuggets of informaiton. If you read the book Freakonomics by Levitt and Dubner, you’ll like this book.
PS – If you are wondering why 2 book reviews in one week, its because I got caught up on reading during vacation. 🙂