Predictive Analytics - Did You Know

Thoughts on Predictive Analytics, Big Data, IoT and Machine Learning

Jim Adams - Systems Engineer © 2019

ReKognition

Reposted 10/18/2019. Original Posting on – 02/15/2019

Did You Know that Amazon has software called ReKognition that does image analysis? At a high level, looking at a picture, it can identify objects such as people, text, scenes, and activities. I mean this is image recognition on steroids. For example, it can look at an image and tell if the faces are male or female; if the faces are smiling or frowning or if the eyes are open or shut. It can tell if the ground is gravel or paved or grass, for instance. Very cool analysis. But wait, let’s jam on this for a minute.

This is the most impressive image analysis SW I have ever seen but we are still a LONG way off from real in-depth image content analysis. Here is an example of what can and can’t be done yet with image recognition. I have a picture of my dog sitting under a Cottonwood tree in my back yard in Gilbert AZ. Image analysis can tell the GPS coordinates, date, time and that there is a dog and a tree and the sky is blue and the grass is green. Very cool, but it cannot tell that it is a Cottonwood tree and a mixed-breed German Shepard with fleas and that the grass needs cut!! All in all, still very cool that Amazon can do this.

In other words, today, in many cases, a person has to enter the specific meta data about the Cottonwood and the German Shepard and any other information that may be in the background such as swing-sets and garage and lawn mower that are in the background of the image. Overall, this technology is progressing rapidly and is certainly very cool, but I wanted you to be aware that there is still a long way to go with image recognition.

References


R, Python, Java

Reposted 09/27/2019. Original Posting on – 02/22/2019

Did You Know there are four very popular languages used in data science, today? And, none of them are COBOL or FORTRAN for you old timers!!! Here are the top four.

1.) R: The language named “R” is the number one language for data analytics and data mining. It was developed in 1997 as a free substitute to expensive statistical software like Matlab or SAS. Using R, you can sift through complex data sets, create sleek graphics to represent numbers in just a few lines of code. I have used R a lot and it is excellent for creating spectacular charts and graphs and is fairly easy to learn. Free download of R Studio

2.) Python: Python has fast data mining capabilities and more practical programming capabilities. Python is capable of statistical analysis but has emerged as a good option for general data processing and software programming. Python is VERY popular within our customer base and is slowly taking over the Java market especially since Oracle has decided to now charge for Java. Free download of community edition of PyCharm. PyCharm is a nice Python editor and compiler.

3.) JULIA: There is still a gap which is filled by Julia. Julia has widespread industry adoption, and it is high-level, fast and an expressive language. It is more scalable than Python, and R. It is used for high-performance numerical analysis – much the way FORTRAN was back in the 60’s. Julia is a language gaining steam and is very promising. The data science community using Julia is in its early stages. It is interesting how these languages are developed and adopted.

4.) Java: Java is an old and famous language used in the development of social media sites such as Facebook, LinkedIn, and Twitter. Java doesn’t have the same quality of visualization like R and Python but it is good for plowing through massive data sets and doing statistical modeling. Today, Java is still the most popular programming general purpose programming language on planet Earth. Free Java IDE download called Netbeans 8.2

Personally, I have never even heard of JULIA but I have used R, Python and Java hundreds of times. All three are free and can be downloaded from the web. There are some excellent free Software Development Kits (SDK) that you can download that, arguably, make programming fun and easy. In my opinion, R and Python are the easiest to learn. I have some downloadable books with links here.

References


R and Analytics

Reposted 09/14/2019. Original Posting on 05/03/2019

Did You Know that the programming language named “R” is the number one language for data analytics and data mining? Yup. It was developed in 1997 as a free substitute to expensive statistical software like Matlab or SAS. Using R, you can sift through complex data sets, create cool graphics to represent numbers in just a few lines of code. I have used R a lot and it is excellent for creating spectacular charts and graphs and is fairly easy to learn. I use it for data center assessments to predict future workloads and utilizations.

There is an Integrated Development Environment (IDE) that is easy to download, install and use. It will allow input of Excel sheets and raw data from various sources. It has the source code window and graphs and debugging all on a single pane of glass. Extremely cool and easy to use. Here is a free download of R Studio which has the R language embedded in it.

So, who uses R, you ask? Lots of organizations that you know and love, such as:

  • Facebook – For behavior analysis related to status updates and profile pictures
  • Google – For advertising effectiveness and economic forecasting
  • Twitter – For data visualization and semantic clustering
  • Uber – For statistical analysis
  • Airbnb – Scale data science
  • Bank of America and American Express.

For an exhaustive list of who uses R, please navigate to this link.

In my humble opinion, R is one of the easiest computer languages to learn.

References


Natural Language Processing

Reposted 09/06/2019. Original Posting on 04/19/2019

Did You Know that Natural Language Processing (NLP) is still one of the most challenging computing problems we face today? Take for example the sentence, “I drove the car into the garage and it died.” What exactly died and did death really occur? And the grammar implies that the garage died. Weird. As English-speaking adults, we all implicitly know the car stopped running but the sentence can be misconstrued by software. This is a huge problem for software and NLP.

Sure, Siri and Google Voice can understand the vast majority of what we say but there are many phrases that confuse them. Here is another example that I like, “The committee denied the group a parade permit because they feared violence”. Change the word “feared” to “advocated” and the entire meaning shifts from the group to the committee, as we have here, “The committee denied the group a parade permit because they advocated violence”. Is this crazy or what.

There are thousands of these convoluted sentences, questions, and phrases. Siri and Google Voice operate mainly on simple one answer questions or statements like “Play Doobie Brothers” or “What is the weather in St Louis”? But give it a Jeopardy answer and ask it to find the question and you will need IBM’s Watson to find the answer. And, Watson uses Predictive Analytics to statistically determine which answer is the most probable one. Watson produces many answers ranked by probability as to the one “most likely” to be correct.

Here are two more of my favorites:

  • “Environmental regulators grill business owner over illegal coal fires.”
  • “Women, without her, man is lost” or “Women, without her man, is lost”. (I moved a single comma and the meaning changes)

Predictive Analytics is now a huge part of our lives and will become even more intertwined in our lives going forward. I think it is very exciting.

References


Pareidolia

Posted – 08/30/2019

Did You Know that our eyes and brains can make sense of images even when the images are constructed out of peculiar materials? Our minds have a way of organizing data points even when they are a confusing mass of points. Similarly, we can also make up patterns from clouds or ink blots by seeing things that are not actually there. Have you ever seen a face on the moon or in a cloud? Weird huh.

Machine Learning Machine Learning

This is called Pareidolia which is our ability to turn a vague image into something meaningful. And as you probably guessed, it is the premise behind Big Data and Predictive Analytics – looking at ridiculous amounts of data for interesting patterns. Can you see a face in the images below? One is made up of M&M’s and the other dominoes. Pareidolia is hard at work here.

If you’re into statistics, Pareidolia is considered to be Type I error or simply jumping to conclusions – the false positive. But this is not nearly as dangerous as Type II error – not seeing the connection when there is one. Finding patterns is better than not finding patterns. Walmart, Target – heck all of the big retail giants – use some form of pattern recognition to determine what product sells and what triggers higher sales. They don’t really care why. They just want to see the correlation between events to improve sales. If a storm is approaching the Florida coastline, sales of strawberry pop tarts goes up by 7X. Walmart doesn’t care why. They stock up on pop tarts!!! They recognized a pattern in the data. Again, much better that not seeing the pattern.

References


Google Flu Predictive Analytics

Posted – 08/23/2019

Did You Know that Google can predict a flu outbreak in a specific neighborhood in mere minutes where it takes weeks before the Center for Disease Control (CDC) can detect that same outbreak? Crazy. Google does it on the fly, too, with predictive analytics and with a serious amount of Big Data. Yes, Big Data is the key but so is the ability to process a friggen ton of data very fast. Google uses your search criteria and location data – search criteria such as Urgent Care, Thermometers, Chicken Soup, Theraflu or Robitussin and your zip code to determine that a particular community probably is ill. If thousands of people are searching for flu remedies in a specific part of town, statistically, a portion of those people searching are more than likely sick. Yeah, sure some of the searches are students doing research for a school paper but statistically – plus or minus some margin of error – Google can do the prediction based on all searches they have seen in the past. And as they store more and more data, the analytics become more and more accurate.

Anyway, the notion of Big Data has been around for decades but only in the past few years were we able to process all that data – fast. In the past, we had the data but the computing power was slower and thus we “sampled” the data and even then, it took a while to get the results. Today, we plow through “all the data” in real time and predict the future. Very cool.

Interestingly, we no longer need 100 percent “clean and sanitized” data. The data that Big Data uses includes noise and outliers. But, the sheer volume of data will outpace the noise which is the notion behind the “Wisdom of Crowds”. Check it out. It is very cool how billions of tweets can be used to deduce an answer even with garbage included. The answer will still bubble to the top.

I just finished a great book titled “Big Data by Viktor Mayer-Schoenberger and Kenneth Cukier (2014) and it has hundreds of examples, good and bad, on how Big Data is used. And “The Wisdom of Crowds” by James Surowiecki is equally as entertaining.

References


Wisdom of Crowds

Posted - 08/15/2019

Did you know that the notion of Wisdom of Crowds is a fairly accurate way to get information? Say What? This started in 2004, when James Surowiecki gave a name to the truth and accuracy of the aggregated many called “the wisdom of crowds.” It’s the idea that the collected knowledge or judgments of a large number of people tends to be remarkably correct. One interesting example is where a British scientist named Francis Galton used a large number of people to guess the weight of an ox. Don’t laugh. He held a contest to guess the weight of an ox at a country fair. Galton was inspired to run statistical tests on the responses and discovered that the average of all 787 responses deviated from the ox’s true weight by a single pound. What? This works because some folks guessed high, some guessed low but as the quantity of guesses gets higher and higher, the mean of the guessed weight falls real close to the actual number.

Wisdom

Another example is Wikipedia. The publicly edited online encyclopedia has built a massive collection of articles contributed by millions of users. While the site has its share of detractors, studies by Nature, the Journal of Clinical Oncology, and others have found the resource to have a level of reliability on par with Encyclopedia Britannica. In other words, vast, anonymous crowds have compiled a thorough and reliable encyclopedia just about as well as a certified group of experts. And when it comes to breadth of topics covered, the free encyclopedia far outstrips its rival.

To be fair, there is a lot of debate and discussion around this topic. Does it realy work? It depends. If it is a normally distributed set of opinions that are being tested then accuracy is good. If it is a topic which is fueled by media and bias and rhetoric then no it does not work. In other words if some other force is pulling emotions and decision-making ability one way or the other, then the wisdom crowds notion fails. It has to be used wisely. In some cases, crowds can be remarkably unwise, particularly on complex subjects where the stakes are highest. So, beware on this one. Much research still needs done in my opinion.

References


Micro-expression and Emotion Creation

Posted - 08/09/2019

Did You Know that every stranger’s face hides a secret, but the smiles in the image below conceal a big one: These people do not exist. They were generated by machine learning algorithms for the purpose of probing whether AI-made faces can pass as real. University of Washington professors Jevin West and Carl Bergstrom generated thousands of virtual facial images to use in an online game that pairs each counterfeit with a photo of a real person and challenges players to pick out the true human. Nearly 6 million rounds have been played by half a million people. These are some of the faces that players found most difficult to identify as the cheery replicants they are.

Faces

Artificial Intelligence has advanced from micro-expression and emotion detection to micro-expression and emotion creation. This is a bit unnerving as now can we really be sure of what we see. My dad used to say, “Don’t believe anything you hear and only half of what you see?” I guess we can be leery of everything we see – nowadays.

Originally, facial recognition algorithms figured out how to decipher images. That’s why you can unlock an iPhone with your face. More recently, machine learning has become capable of generating and altering images and video. Back In 2018, researchers and artists took AI-made and enhanced visuals to another level. The faces below were made using a technique invented in 2018 by researchers at Nvidia, the graphics processor company

References

  1. Image found on Wired Magazine Article.

Analytics Maturity Model

Posted - 08/02/2019

Did You Know that there is an Analytics Maturity Model? Say What? We just learned about predictive analytics now there is a formal maturity model!!! The model is stepwise approach to determine where an organization is in its Business Intelligence journey. There are several of these models from Gartner, Forrester, Jirav, TDWI and others but they all illustrate what I have put in my diagram below. They are all roughly the same. The one I created below has excerpts from all of them.

Machine Learning

The maturity model starts at the lowest level with basic Reporting, Analysis and Monitoring which the majority of organizations have performed for years. These lower-level maturity levels are statistical in nature. Then, as an organization matures into the Predictive stages, they find value with a more complex set of tools and algorithms. Many forward-thinking organizations are in this phase, today. Looking at tons of historical data and predicting some future state. The biggest use cases are on the business side of the tent but IT and data centers can benefit from predictive models, as well.

At the top of the top maturity model is the Prescriptive stage which is where an organization can understand why things will happen and how to make things happen. This is where y’all want to be. As one takes this journey the complexity certainly increases but so does the value proposition for the business.

We all have questions related to business and IT. And we are drowning in data. The problem is we don’t always know what data to use; we don’t know where to find it, and we don’t know if the data can be trusted. Many managers and business leaders still use their gut and intuition to make decisions. But, statistically, they tend to be more wrong than right.

References

  1. Image Above Courtesy Jim Adams (July, 2019) Adapted from Gartner and Forrester Maturity Models

Supervised and Unsupervised Machine Learning

Posted - 07/26/2019

Did You Know Machine Learning is a subset of Artificial Intelligence(AI)? Said a bit differently, Machine Learning is a method of data analysis that automates analytical model building. With Machine Learning computers can learn from the data using algorithms without explicitly being programmed to hunt for anything specific. There are two type of ML: Supervised Learning and Unsupervised Learning.

Supervised Machine Learning

This type of ML is built around algorithms that are engineered to hunt for things such as correlations between various know data points. For example, maybe you have a hunch that there is a correlation between sales and weather. You write your programs and algorithms to search for that type of relationship. Supervised learning algorithms require training data to apply the information learned for future prediction.

A decent example of Supervised Learning would be a student learning a course from an instructor. The student knows what he/she is learning from the course. It is my opinion that this is the most popular use of machine learning, so far.

Unsupervised Machine Learning

This type of ML is built around algorithms that are engineered to hunt for anything and everything with no particular relationship in mind. In other words it is hunting for relationships and correlation simply by processing crazy amounts of data. ML of this type is where one can find nuggets of information that are previously unknown. Unsupervised learning is a branch of machine learning that learns from test data that has not been labeled, classified or categorized. Instead of responding to feedback, unsupervised learning identifies commonalities in the data and reacts based on the presence or absence of such commonalities in each new piece of data.

Using the student example, again, Unsupervised Learning would be a student taking a completely unnamed course and learning whatever is taught with no expectation of a subject or direction.

The concept of unsupervised learning is not as widespread as supervised learning. In fact, the concept has been put to use in only a limited number of applications

History

In 1950s, simple algorithms were created to pioneer the notion of machine learning. Alan Turing created the “Turing Test” to determine if a computer has real intelligence. If the computer can convince the user that it is also human, it passes the test. In that decade the perceptron, the first artificial neural network for computers was designed by Frank Rosenblatt.

Recent Developments

Big technology companies like Amazon, Google, and Facebook offer machine learning tools to help developers improve machine learning techniques to apply to business. Today, there are millions of dollars being invested in machine learning and artificial intelligence. Gartner predicts the business value created by AI will reach $3.9T in 2022. And, IDC predicts worldwide spending on cognitive and Artificial Intelligence systems will reach $77.6B in 2022. Yikes.

It is already transforming the automotive industry. One of the most valuable area of machine learning and AI technology is self-driving. Combining computer vision and machine learning systems enable autonomous driving.

Applying machine learning to healthcare is also a big step towards anomaly detection and early diagnosis. Machine learning can learn the symptoms and causes of diseases and make risk assessments.

Applications:

  • Image Recognition
  • Speech Recognition
  • Data Mining
  • Natural Language Processing
  • Prediction Models
  • Classification
  • Medical diagnosis

Every personalized content that you encounter on the web is a product of machine learning that analyzed your behaviors and make predictions for personalized content to show.

  • Email filtering
  • Recommendation Personalization
  • Sales forecasting
  • Lead generation
  • Robotic process automation
  • Risk Management

References


Deep Learning

Posted - 07/19/2019

Did You Know that Deep Learning is a machine learning technique that “teaches” computer systems to do what comes naturally to humans? That is to learn by example and experience. Deep learning is a key technology behind driverless cars. It enables them to recognize a stop sign or to distinguish a pedestrian from other objects. It is the key ingredient to voice control in consumer devices like phones, tablets, TVs, and hands-free speakers. Deep learning is getting a great deal of attention lately and it’s achieving results that were not possible in years past. Machine Learning

Another example, image recognition, can be used by a computer system to recognize and understand every form of chair on planet Earth. So, an image recognition system can come across a recliner, dining chair, step stool, rocking chair, you name it , and the system will determine that it can sit on a chair if it sees one in its database of chair images. But what if it sees a large rock, log or stump in the woods? Can the algorithms deduce that it can sit on a large rock based upon images of chairs? That type of decision making and inductive reasoning is where deep learning comes into play. Being able to infer one thing from another set of data points. A human can know immediately that he or she can sit on a stump as well as a kitchen chair.

In deep learning, a computer model learns to perform classification tasks directly from images, text, or sound. Deep learning models can achieve state-of-the-art accuracy, sometimes exceeding human-level performance. Models are trained by using a large set of labeled data and neural network architectures that contain many layers.

References

  1. Image to the above-right found at Datamation

Micro-expression and Emotion Detection

Posted - 07/12/2019

Did You Know that Maya Angelou once said, “People will forget what you said, people will forget what you did, but people will never forget how you made them feel.” I totally agree. Machine Learning

It is the same in business. It is important to know how customers feel about our services, products and our team. Sales figures, surveys, social media posts, ratings will help get a general idea about customer sentiment, but it does not provide the finer, granular insights regarding what goes unsaid. This is where emotion analysis and micro-expression detection come into play. Facial expressions and voice modulation can be analyzed to interpret the emotional state of an individual or group. This technology goes a step beyond facial recognition and sentiment analysis to provide deeper insights about what people are truly feeling at a given point of time.

There is an excellent TV show on Fox called “Lie to Me” which goes deep into this subject in a dramatic and funny way. It is exciting and scary at the same time. There are people that actually study micro-expressions and can tell your state of mind or if you are fibbing just by watching your body language!! And there are a few “naturals” that can detect the finest micro-expressions. That is crazy.

Machine Learning

Also, I just finished an excellent book on this same subject titled “Reading People” by Jo-Ellen Dimitrius, Ph.D., that goes into excruciating details on how this is done – and how you too can be painfully aware of your body language and how others interpret how you interact. It is used for jury selection, interviews, depositions, TSA, event access, etc. One crazy example is that Disney uses emotion analysis to capture the response of audiences for their movies. The practice involves capturing viewers facial expressions during a movie using infrared cameras. This provides the them with numerous data points that are fed to an artificial intelligence algorithm. They can tell how well a move is enjoyed frame by frame. Wow.

And there is now software that does this type of expression and emotion analysis with results like what is shown in the image below. Yes, it is all based on machine learning, AI and predictive analytics. Nothing is certain in this technology space but with probabilistic analysis you can get a very good idea of what someone is feeling.

References

  1. Images Found at Frontiers in Psychology

City of Boston Big Data

Posted – 06/28/2019

Did You Know that the City of Boston uses Big Data to identify potholes in the streets? Wait, what? The solution they came up with uses magnitude-of-acceleration spikes along a cell phone’s z-axis to spot impacts. Said differently, the phone uses your velocity and quick changes of height to tell that you are driving and hitting potholes. How crazy is that?

Machine Learning

They developed an app for the iPhone and Android and allowed people in Boston to load the app and report this data automatically. The data points and subsequent analysis can save significant road survey costs. Navigation systems can also use the cell phone data to avoid traffic congestion and offer alternate routes. The City of Boston got the idea from a crowd sourcing contest.

References


Descriptive, Predictive and Prescriptive Analytics

Posted – 06/21/2019

Did you know there are three type of analytics; Descriptive, Predictive and Prescriptive.

Descriptive Analytics

Descriptive Analytics, commonly known as Statistics, offers Insight into the past. The vast majority of the statistics that we use falls into this category. It generally involves basic arithmetic like sums, averages, medians, percent, etc. Usually, the underlying data is a count, or aggregate of a filtered column of data to which basic math is applied. For all practical purposes, there are an infinite number of these statistics. Descriptive statistics are useful to show things like, total stock in inventory, average dollars spent per customer and Year over Year, storage array IO, CPU utilization, market trends.

Prescriptive

Predictive Analytics

The goal of Predictive Analytics is to look at the Descriptive Analytics output and predict some future state – within limits of course. With the right approach and the proper algorithm, data can be analyzed to find things such as potential component failures, future states and forecasting events or other critical issues. Predictive Analytics is all about understanding the future. It can provide you with actionable insights based on past data. Predictive analytics can give you estimates about the likelihood of a future outcome. Companies use these Predictive Analytics to forecast what might happen in the future. This is because the foundation of predictive analytics is based on probabilities.

Prescriptive Analytics

Prescriptive analytics takes Predictive Analytics one step further by offering specific and actionable next steps for items discovered in the predictive data analysis. While predictive analytics can tell you what will happen and when it will happen, prescriptive analytics applies machine learning to suggest options that one can take for capitalizing on the analysis.

Forrester defines Prescriptive Analytics as “Any combination of analytics, math, experiments, simulation, and/or artificial intelligence used to improve the effectiveness of decisions made by humans or by decision logic embedded in applications”. Here is a great online article titled “What Exactly the Heck are Prescriptive Analytics”. It goes deep into what makes up Prescriptive Analytics. An excellent read if you like analytics. Seriously, who doesn’t?

Summary

  • Descriptive Analytics - Insight into the Past
  • Predictive Analytics - Understanding the Future
  • Prescriptive Analytics - Advise on Possible Outcomes

References


Machine Learning

Posted – 06/07/2019

Did You Know that there are two forms of Machine Learning? I know, ML is confusing and here is why. It is like Theoretical Physics and Applied Physics. One is used to look deeper to understand the physical world and the other is taking what has been discovered and applying it to real world problems. Machine Learning has two parts, as well. Mathematicians and scientists use one level of ML to discover algorithms in a theoretical manner. And the rest of us use what has already been discovered and developed to understand the real world of business and engineering. Machine Learning is a subset of Artificial Intelligence and sits above Deep Learning as can be seen in the graphic below.

So, who uses ML? Well, Amazon is one of the biggest known users of ML and this extends to the Kindle and the Echo. IBM Watson uses ML. So does Apple, Facebook, Google and Qualcomm to mention six out of hundreds.

You are no doubt wondering what are the algorithms that people use for ML. Well, inside each those two forms of ML described above, there are plenty of algorithms, such as:

  • Attention Mechanisms & Memory Networks
  • Bayes Theorem & Naive Bayes
  • Decision Trees
  • Eigenvectors, eigenvalues
  • Evolutionary & Genetic Algorithms
  • Expert Systems/Rules Engines/Symbolic Reasoning
  • Linear Regression
  • Generative Adversarial Networks (GANs)
  • Graph Analytics
  • Logistic Regression
  • LSTMs and RNNs
  • Markov Chain Monte Carlo Methods (MCMC)
  • Neural Networks
  • Random Forests
  • Reinforcement Learning
  • Natural Language Processing (NLP).

Yeah, that is a lot!!! Each of these can be a chapter of a book. So, let’s end here and go into some of these later.

References

- Multiple Regression Analysis from Explorable.com (Jun 18, 2009)
- Seven Types of Regression Techniques by Sunil Ray, August 14, 2015
- Machine Learning and Applied AI by By Andy Patrizio, Posted March 21, 2019
- Top Predictive Analytics Examples
- Polynomial Regression by Animesh Agarwal, October 8, 2018


Eight Types of Regression

Posted – 05/31/2019

Did You Know that there are at least 8 types of Regression Analysis techniques used in Machine Learning (ML) and Predictive Analytics (PA)? Wait, what? Yes, eight. Each form has its own importance and a specific condition where they are best suited to apply. Now to be fair, the most common technique for regression is definitely Linear Regression mainly because it is reasonably simple to understand and implement. Non-linear Regression (also called Polynomial Regression) is more difficult and Multiple Regression even more rigorous. Where would one use Multiple Regression? Say, for example the yield of rice per acre depends upon quality of seed, fertility of soil, fertilizer used, temperature, rainfall. That is five variables that can be used to converge on a solution. Yeah, the math is ugly but it can be done with Excel or R or Python.

Bringing this back to our beloved business of IT and Data Centers, most of the day-to-day analytics uses Linear Regression with just two variables such as CPU Utilization and Time, or IOPS and Days, or BTU and Months, etc. Usually some function of the data center and time like hours, days, months, etc.

So, not to be pedantic, but to show the list the of major regression techniques, here they are:

1. Linear Regression
2. Non-Linear (Polynomial) Regression
3. Multiple Regression
4. Logistic Regression
5. Stepwise Regression
6. Ridge Regression
7. Lasso Regression
8. ElasticNet Regression

The first four in the list above are worth learning and are used predominately in the world of PA and ML. But the others are used in all sorts of abstract ways, usually academic settings or deep learning and research. Predictive Analytics and Machine Learning are a huge part of our lives and will become even more intertwined in our lives going forward. I’m excited, are you?

References

- Type of Regression Techniques by i2 Tutorials
- 15 TYPES OF REGRESSION YOU SHOULD KNOW found at Listen Data
- Regression Basics by Mohammad Mahbobi


Big Data is Big Business

Posted– 05/17/2019

Did You Know that Big Data is Big Business? And, the market for Big Data solutions is clearly growing every day. Based on data from Statista (see chart below) revenue from the Big Data market is forecasted to grow from an estimated $49 billion in 2019 to $103 billion by 2027.

The reason that revenue is expected to double in less than a decade is that data analytics is no longer optional for companies. For enterprises, managing and analyzing big data has become a critical part of how business gets done. According to a 2018 Big Data Executive Survey by New Vantage Partners, 97 percent of executives say they are investing in data projects (this is Big Data and AI combined).

I attached a short paper I downloaded from the web, titled Big Data 2019: Mining Data for Revenue. It contains a lot of information on the Big Data industry.

Prescriptive

Now, the reason for this message is to shed light on Big Data and how it relates to Machine Leaning and Predictive Analytics. Big Data is the combination of all sorts of noisy data from an enterprise and even from data sources from outside an enterprise. It contains unstructured data such as spreadsheets, PDFs, text files, machine data, log files, GPS, sensor data, etc. as well, structured data from databases. What makes it even BIGGER is many times it includes telemetry data, Tweets, Facebook data, weather data, traffic data, call logs, etc. from outside sources. All this is blended together into a big data lake if you will. That is the Big Data part of the discussion.

Then, data scientists, or data analysts, write software or use specialized tools to skinny down the big data into small data. Yup, small data. It is still a lot of data but smaller than all the data combined. In other words, they filter out the noise, bad data, outliers and other arguably meaningless data to get to a clean data set that is worthy of being mined. It could still be millions of lines of data!!

Once the data set is cleaned and filtered it is now worthy of Machine Learning algorithms. These ML algorithms are programs, processes and tools that read the cleaned data and look for patterns, clusters, classes and correlations. Within this data, there may be clusters and various classifications. At some point, an equation of a line is found within this data. Sometimes not. Sometimes, there are many lines or relationships that are found. This is the Machine Learning part of the debate.

These equations or relationships that were found are used for Trend Analysis, Time Series Analysis and Predictive Analytics. So, we take a known set of data, find the line that represents the majority of the data points and use that line to predict future values or relationships. Very cool, indeed. That is the Predictive Analytics part of the story.

Obviously, these three topics, Big Data, Machine Learning and Predictive Analytics are extensive subjects in and of themselves. But hopefully, I tied them together so you can differentiate them to some degree. And, of course, we’ll dig deep into each of them on future Friday – Did You Know messages.

References

- Big Data by Viktor Mayer-Schoenberger and Kenneth Cukier, 2014
- Why is Big Data Big Business by University of Southern Indiana, March 20, 2018


The Stethoscope and Analytics

Posted– 05/10/2019

Did You Know that the stethoscope was invented in France in 1816 by René Laennec and is now being replaced by an IoT device to do the same? Sensor technology today, is so advanced that an Apple Watch or Fitbit can get a better pulse rate on the iWatch than any doctor on the planet can get with an acoustic stethoscope. I read in Machine Design that sensors today can detect heart and lung defects better than a doctor in less time and far better accuracy. (Machine Design, April 2019, p11). Plus, being digital, whatever the sensor detects can be ported right into a database or medical system with no translation required by medical staff. Faster, better, probably not cheaper!!! Extremely cool.

There are modern electronic stethoscopes on the market. They work by converting acoustic waves into electric signals that are then processed in a device to amplify sounds. The device from John Hopkins mitigates noise by improving the coupling between the patient’s body. It swaps the rubber hose for an electric cable and employs digital noise-control techniques to capture a stronger signal. Heck, the Fitbit and Apple Watch can both detect atrial fibrillation (AFib). Can you say “Awesome?”

References

- Digital Stethoscope; a Major Game Changer in Healthcare by Dr. Hafsa Akbar Ali, December 14, 2018
- The Stethoscope Gets a “Smart” IoT Upgrade Machine Design. Carlos Gonzalez, Feb 20, 2019


R and Analytics

Posted – 05/03/2019

Did You Know that the programming language named “R” is the number one language for data analytics and data mining? Yup. It was developed in 1997 as a free substitute to expensive statistical software like Matlab or SAS. Using R, you can sift through complex data sets, create cool graphics to represent numbers in just a few lines of code. I have used R a lot and it is excellent for creating spectacular charts and graphs and is fairly easy to learn. I use it for data center assessments to predict future workloads and utilizations.

There is an Integrated Development Environment (IDE) that is easy to download, install and use. It will allow input of Excel sheets and raw data from various sources. It has the source code window and graphs and debugging all on a single pane of glass. Extremely cool and easy to use. Here is a free download of R Studio which has the R language embedded in it.

So, who uses R, you ask? Lots of organizations that you know and love, such as:

- Facebook – For behavior analysis related to status updates and profile pictures
- Google – For advertising effectiveness and economic forecasting
- Twitter – For data visualization and semantic clustering
- Uber – For statistical analysis
- Airbnb – Scale data science
- Bank of America and American Express.

For an exhaustive list of who uses R, please navigate to this link.

In my humble opinion, R is one of the easiest computer languages to learn.

References

- R for Dummies by Andrie de Vries and Joris Meys (435 page PDF)
- R in a Nutshell by Joseph Adler (732 Page PDF)
- Learning R by Richard Cotton (400 page PDF)


Natural Language Processing

Posted – 04/19/2019

Did You Know that Natural Language Processing (NLP) is still one of the most challenging computing problems we face today? Take for example the sentence, “I drove the car into the garage and it died.” What exactly died and did death really occur? And the grammar implies that the garage died. Weird. As English-speaking adults, we all implicitly know the car stopped running but the sentence can be misconstrued by software. This is a huge problem for software and NLP.

Sure, Siri and Google Voice can understand the vast majority of what we say but there are many phrases that confuse them. Here is another example that I like, “The committee denied the group a parade permit because they feared violence”. Change the word “feared” to “advocated” and the entire meaning shifts from the group to the committee, as we have here, “The committee denied the group a parade permit because they advocated violence”. Is this crazy or what.

There are thousands of these convoluted sentences, questions, and phrases. Siri and Google Voice operate mainly on simple one answer questions or statements like “Play Doobie Brothers” or “What is the weather in St Louis”? But give it a Jeopardy answer and ask it to find the question and you will need IBM’s Watson to find the answer. And, Watson uses Predictive Analytics to statistically determine which answer is the most probable one. Watson produces many answers ranked by probability as to the one “most likely” to be correct.

Here are two more of my favorites:

  • “Environmental regulators grill business owner over illegal coal fires.”
  • “Women, without her, man is lost” or “Women, without her man, is lost”. (I moved a single comma and the meaning changes)

Predictive Analytics is now a huge part of our lives and will become even more intertwined in our lives going forward. I think it is very exciting.

References


kNN and Predictive Analytics

Posted – 04/12/2019

Did You Know that there is a Predictive Analytics algorithm to look at a bunch of numbers and determine some classification such as gender, race, color, car model, country, state, region, shirt size, etc. The algorithm is called k-Nearest Neighbors or k-NN for short. And, because I am a bit on the geeky side, I recently wrote a white paper on how to do this. It is attached for your reading enjoyment. Read it. Enjoy it. Pass it on.

k-NN is used in machine learning. It puts all the data points into a huge n-dimensional array and then uses Euclidean geometry to get the distance to all the points in the data set to the test subject. The points with the shortest distance are probably similar to the test subject. How friggin cool is that?

For instance, let’s say we have 5 chairs, 10 beds and 15 tables, and for each we know the length, width, and height. Now, if someone gives us a new object and asks us to predict which category that new object belongs, given we only have the length, width, and height, we can predict whether the object is a chair, table or bed using the k-Nearest Neighbor algorithm.

Another example is we have the weight, height, and gender of a lot of people. If we are given a new weight and height combo, we can predict the gender using k-NN. Stretching this example even further to give an example of where this is used would be using age, height, weight, and gender to predict heart size. This is an excellent medical example where we can’t see the heart physically, but we can predict with some certainty the size of a heart using other attributes and k-NN.

And, arguably, the biggest use of k-NN would be for Recommender Systems. If we know that a user likes a specific item, then we can recommend similar items for them using k-NN. Recommender Systems are used in the recommendation of news or videos for media, product recommendations or personalization in travel, retail, video on demand, and music streaming.

The KNN algorithm is one of the most popular algorithms for text categorization or text mining. Another area where k-NN is used is in agriculture for simulating daily precipitation and other weather variables. Other interesting applications of k-NN are forecasting the stock market, predicting the price of a stock based on company performance measures and economic data, predicting currency exchange rates and bankruptcies, understanding and managing financial risk, trading futures, credit ratings and money laundering analyses. All very cool and popular areas of prediction.

Predictive Analytics and Machine Learning are now a huge part of our lives and will become even more intertwined in our lives going forward. I think it is very exciting.

References

- USING THE K-NEAREST NEIGHBOR ALGORITHM Jim Adams, April 8, 2019


Correlation Between FICO Scores and Grammar

Posted – 03/29/2019

Did You Know that some forward-thinking companies in the medical industry use your FICO score to determine the probability that you will take your medication? There is a strong correlation between a FICO score and personally responsibility. Similarly, insurance companies and lenders use your application data to see if you use proper grammar and capitalization to determine your interest rates and risk assessments. They deduced that people who use proper word capitalization and grammar are more responsible. Hence, more responsible applicants get better insurance rates and lower interest rates for loans. All this is done with the notion that mining Big Data revealed a correlation between these events. Can you say “interesting?”

That is what Big Data is all about. The hunt for correlation, not causation. Using software to bang through literally petabytes of sales data looking for correlations to weather, traffic, political events, and other completely unrelated data points.

References

- Machine Learning abd FICO ® Scores September, 2019
- Why Hospitals Want Your Credit Report by By Sarah Rubenstein, March 18, 2018


Strawberry Pop Tarts and Hurricanes

Posted – 03/08/2019

Did You Know that the demand for Strawberry pop-tarts goes up by 7X when a big storm is approaching the Florida coast line? What a crazy correlation. This is exactly what Walmart has found when plowing through big data sets and looking for correlations. They do not know why or even care why. All they know is that it happens. Walmart is on the statistical hunt for the “What” not the “Why”.

That is what Big Data is all about. The hunt for correlation, not causation, to put a scientific spin on the words. Using software to rip and tear at petabytes of sales data looking for correlations between sales and weather, traffic, political events, world events and other completely unrelated data points. These unrelated data points can then be transformed into higher profits and larger markets. Target, Walmart, Coca Cola, McDonalds, Netflix, Chase, NFL, all use Big Data relentlessly to see who, what, where people buy their products. They’re not concerned with the why. Interesting, is it not.

References

- What the Weather Has in Stores IBM


Predictive Analytics – Click, Buy, Lie or Die

Posted – 03/15/2019

Did You Know that businesses, political groups, and government agencies use Predictive Analytics and Social Currency to tell if you will Click, Buy, Lie or Die? It’s true. And, it’s astonishing how this is determined. They primarily use Twitter Feeds, Facebook, Instagram, and LinkedIn to determine what’s hot and what’s not. They use sentiment analysis and social currency to determine trends and patterns. This data is not used to determine what you or I will do specifically, but what we as a group will do.

Tell me more, you ask. They can use the 500 million Tweets per day to tell if we like a movie, product, or political candidate. It’s the same with Facebook with its even bigger with 2.32 billion monthly active users. Hundreds of millions of people post their thoughts daily as Social Currency. We have this built-in desire to be heard and to share our thoughts, feelings, and ideas – especially if we are angry or aroused about something. It is difficult to not share when we are upset or excited about something. People post about new babies, marriages, breakups, company closings, new products, failed products, politicians that are hated and loved. The list is long. Each is an emotional tug and most people love to share. The younger generation tends to use online sharing where older folks tend to share face-to-face over a drink at the local tavern or ball-game. It’s the same thing – different medium. We all have a need to share information. It is our social currency.

This online sharing of social currency is where businesses, political groups, and government agencies use Predictive Analytics to tell what we are doing today and what we may do tomorrow. It is very exciting (to me anyway) how this is all done with a ridiculous amount of data and complex statistical analysis.

And if you think you are not being sized up since you don’t use Twitter or Facebook, think again. These data feeds come from all vectors. Data points stream endlessly from our daily life from phones, credit cards and televisions and computers. It comes from the infrastructure of cities and factories and satellites in the form of camera feeds, thermostats, and license plate readers. It comes from sensor-equipped buildings, trains, buses, planes, bridges, and ships – even your car with GPS and a thousand sensors that monitor your driving habits.

Predictive Analytics is now a huge part of our lives and will become even more intertwined in our lives going forward. I think it is very exciting.

References

- Predictive Analytics: Click, Buy, Lie or Die February 19, 2013


R, Python, Java

Posted – 02/22/2019

Did You Know there are four very popular languages used in data science, today? And, none of them are COBOL or FORTRAN for you old timers!!! Here are the top four.

1.) R: The language named “R” is the number one language for data analytics and data mining. It was developed in 1997 as a free substitute to expensive statistical software like Matlab or SAS. Using R, you can sift through complex data sets, create sleek graphics to represent numbers in just a few lines of code. I have used R a lot and it is excellent for creating spectacular charts and graphs and is fairly easy to learn. Free download of R Studio

2.) Python: Python has fast data mining capabilities and more practical programming capabilities. Python is capable of statistical analysis but has emerged as a good option for general data processing and software programming. Python is VERY popular within our customer base and is slowly taking over the Java market especially since Oracle has decided to now charge for Java. Free download of community edition of PyCharm. PyCharm is a nice Python editor and compiler.

3.) JULIA: There is still a gap which is filled by Julia. Julia has widespread industry adoption, and it is high-level, fast and an expressive language. It is more scalable than Python, and R. It is used for high-performance numerical analysis – much the way FORTRAN was back in the 60’s. Julia is a language gaining steam and is very promising. The data science community using Julia is in its early stages. It is interesting how these languages are developed and adopted.

4.) Java: Java is an old and famous language used in the development of social media sites such as Facebook, LinkedIn, and Twitter. Java doesn’t have the same quality of visualization like R and Python but it is good for plowing through massive data sets and doing statistical modeling. Today, Java is still the most popular programming general purpose programming language on planet Earth. Free Java IDE download called Netbeans 8.2

Personally, I have never even heard of JULIA but I have used R, Python and Java hundreds of times. All three are free and can be downloaded from the web. There are some excellent free Software Development Kits (SDK) that you can download that, arguably, make programming fun and easy. In my opinion, R and Python are the easiest to learn. I have some downloadable books with links here.

References

- R for Dummies by Andrie de Vries and Joris Meys (435 page PDF)
- R in a Nutshell by Joseph Adler (732 Page PDF)
- Learning R by Richard Cotton (400 page PDF)


ReKognition

Posted – 02/15/2019

Did You Know that Amazon has software called ReKognition that does image analysis? At a high level, looking at a picture, it can identify objects such as people, text, scenes, and activities. I mean this is image recognition on steroids. For example, it can look at an image and tell if the faces are male or female; if the faces are smiling or frowning or if the eyes are open or shut. It can tell if the ground is gravel or paved or grass, for instance. Very cool analysis. But wait, let’s jam on this for a minute.

This is the most impressive image analysis SW I have ever seen but we are still a LONG way off from real in-depth image content analysis. Here is an example of what can and can’t be done yet with image recognition. I have a picture of my dog sitting under a Cottonwood tree in my back yard in Gilbert AZ. Image analysis can tell the GPS coordinates, date, time and that there is a dog and a tree and the sky is blue and the grass is green. Very cool, but it cannot tell that it is a Cottonwood tree and a mixed-breed German Shepard with fleas and that the grass needs cut!! All in all, still very cool that Amazon can do this.

In other words, today, in many cases, a person has to enter the specific meta data about the Cottonwood and the German Shepard and any other information that may be in the background such as swing-sets and garage and lawn mower that are in the background of the image. Overall, this technology is progressing rapidly and is certainly very cool, but I wanted you to be aware that there is still a long way to go with image recognition.

References

- What is Amazon ReKognition


Regression Analysis and Predictive Analytics

Posted - 12/07/2018

Did You Know the most popular statistical analysis methodology used in Predictive Analytics is Linear Regression? Linear regression is used more often than, k-Nearest Neighbors (kNN), Decision Trees, and Naïve Bayesian. These methods are used mostly to classify and make sense of data and to find associations within that data. Believe it or not, y’all have seen predictive analytics in action every summer in hurricane season. When the weather channel shows the predicted hurricane path, it always widens out as the path gets longer. That widened path is the predictive interval spread about the non-linear predicted path. Gotta luv it.

Going a bit further, Linear Regression is a statistical method that uses a least squares approach by plotting a line between all the data points. Once the line has been established, one can use the equation of the line to predict future values along that line. There is some uncertainly here given there is variability in the data set. We can extend linear regression to nonlinear regression and multiple regression where we use curves rather than straight lines and more than two variables in the training data set. The regression analysis algorithms are used to predict a future numerical value from a set of past numerical data values.

Another example of a regression model, is market research. It could include understanding how the likelihood to purchase is affected online by the ease of product search and the delivery cost. The regression output could show that the ease of product search has a stronger association with a likelihood to purchase and as a result, more focus should be placed on improving that variable over delivery cost.

References

- Predictive Analytics and Regression Models Explained by Cassandra McNeill, December 5, 2017
- What is Regression Analysis and Why Should I Use It? by Ben Foley, 02.14.2018


Site Last Updated on 07/23/2019

This site created and maintaned by Jim Adams, Systems Engineer, Gilbert AZ. The site is 100% W3C compliant. Version 1.0 - July 2019
Engineered By Jim Adams