# You should write down your answer in your notes!
Introduction to Data Science
What is Data Science?
Important Information
Email: joanna_bieri@redlands.edu
Office Hours take place in Duke 209 – Office Hours Schedule
Class Website:
Introducing Data Science:
Reading from: Data Science for Beginners
Author: Dmitry Soshnikov
What is Data?
In our everyday life, we are constantly surrounded by data. The text you are reading now is data. The list of phone numbers of your friends in your smartphone is data, as well as the current time displayed on your watch. As human beings, we naturally operate with data by counting the money we have or by writing letters to our friends.
However, data became much more critical with the creation of computers. The primary role of computers is to perform computations, but they need data to operate on. Thus, we need to understand how computers store and process data.
With the emergence of the Internet, the role of computers as data handling devices increased. If you think about it, we now use computers more and more for data processing and communication, rather than actual computations. When we write an e-mail to a friend or search for some information on the Internet - we are essentially creating, storing, transmitting, and manipulating data.
You Try List some examples of data that you might interact with. This could be data about yourself or data about a project that you are interested in.
What is Data Science?
On Wikipedia, Data Science is defined as a scientific field that uses scientific methods to extract knowledge and insights from structured and unstructured data, and apply knowledge and actionable insights from data across a broad range of application domains.
This definition highlights the following important aspects of data science:
- The main goal of data science is to extract knowledge from data, in other words - to understand data, find some hidden relationships and build a model.
- Data science uses scientific methods, such as probability and statistics. In fact, when the term data science was first introduced, some people argued that data science was just a new fancy name for statistics. Nowadays it has become evident that the field is much broader.
- Obtained knowledge should be applied to produce some actionable insights, i.e. practical insights that you can apply to real business situations.
- We should be able to operate on both structured and unstructured data. We will come back to discuss different types of data later in the course.
- Application domain is an important concept, and data scientists often need at least some degree of expertise in the problem domain, for example: finance, medicine, marketing, etc.
What are the types of data?
As we have already mentioned, data is everywhere. We just need to capture it in the right way! It is useful to distinguish between structured and unstructured data. The former is typically represented in some well-structured form, often as a table or number of tables, while the latter is just a collection of files. Sometimes we can also talk about semi-structured data, that have some sort of a structure that may vary greatly.
Structured | Semi-structured | Unstructured |
---|---|---|
List of people with their phone numbers | Wikipedia pages with links | Text of Encyclopedia Britannica |
Temperature in all rooms of a building at every minute for the last 20 years | Collection of scientific papers in JSON format with authors, data of publication, and abstract | File share with corporate documents |
Data for age and gender of all people entering the building | Internet pages | Raw video feed from surveillance camera |
Where can you get data?
There are many possible sources of data, and it will be impossible to list all of them! However, let’s mention some of the typical places where you can get data:
- Structured
- Internet of Things (IoT), including data from different sensors, such as temperature or pressure sensors, provides a lot of useful data. For example, if an office building is equipped with IoT sensors, we can automatically control heating and lighting in order to minimize costs.
- Surveys that we ask users to complete after a purchase, or after visiting a web site.
- Analysis of behavior can, for example, help us understand how deeply a user goes into a site, and what is the typical reason for leaving the site.
- Unstructured
- Texts can be a rich source of insights, such as an overall sentiment score, or extracting keywords and semantic meaning.
- Images or Video. A video from a surveillance camera can be used to estimate traffic on the road, and inform people about potential traffic jams.
- Web server Logs can be used to understand which pages of our site are most often visited, and for how long.
- Semi-structured
- Social Network graphs can be great sources of data about user personalities and potential effectiveness in spreading information around.
- When we have a bunch of photographs from a party, we can try to extract Group Dynamics data by building a graph of people taking pictures with each other.
By knowing different possible sources of data, you can try to think about different scenarios where data science techniques can be applied to know the situation better, and to improve business processes.
You Try For the data examples you gave above, say whether they are Structured, Unstrictured, or Semi-structured (or none) and comment on why? Can you come up with an example for each category?
# You should write down your answer in your notes!
Why is Data Science so AWESOME?
There are lots of reasons:
- You can tell an effective story and make a point about something you care about.
- You can make really beautiful visualizations.
- You can understand the world and answer questions about the world.
- You can get a job!
Here are some great examples of impressive Data Science Visualizations and Projects.