Datascience 161212042051

DESMISTIFICANDO A CIÊNCIA
DE DADOS
NAUBER GOIS
Aluno de Doutorado em Informática Aplicada da Unifor
Área de Estudo: Reinforcement Learning e
metaheurísticas aplicadas a testes de performance.
Analista de Desenvolvimento do Serpro
AGENDA
DEFINIÇÃO DE DATA SCIENCE
TIPOS DE MODELOS
EXEMPLOS DE APLICAÇÃO
RELATÓRIOS
TESTES A/B
O QUE É CIÊNCIA
DE DADOS?
DATA SCIENCE É MAIS
UM TERMO USADO
PARA DESCREVER O
PROCESSO DE
TRANSFORMAÇÃO DE
DADOS EM
CONHECIMENTO.
(LOUKIDES, 2016)
5
CIENTISTA DE
DADOS
Matemática e
estatística
Banco de Dados e
Programação
Conhecimento de
Negócio
Comunicação
Usar (coletar, armazenar, publicar) dados
não é data science. É preciso agregar
valor aos dados e permitir novas formas
de uso.
Data scientist: Pessoa que é melhor em

estatística que quaisquer engenheiros de
software e melhor em engenharia de
software que quaisquer estatístico – Josh
Wills
Data scientist: Person who is better at statistics than any
software engineer and better at software engineering than
any statistician – Josh Wills

7
DESMISTIFICANDO A CIÊNCIA DE DADOS
DATA SCIENCE VENN DIAGRAM
9
Seleção de Features e Redução de

Dimensionalidade
U.M.Fayyad, G.Patetsky-
Shapiro and P.Smyth (1995)
PROCESSO?
COLETAR E PROCESSAR OS DADOS

Conduzir experimento de pesquisa.
Coletar amostras de uma população.
Transformar , filtrar e sumarizar os dados.
Preparar os dados para o modelo escolhido
13
FORMULAÇÃO DE UM PROBLEMA
Identificação de uma área de interesse e o tipo de modelo
Clustering
14
ESCOLHA DO TIPO DE MODELO A SER APLICADO

15
Common Data Mining tasks
Clustering Classificação Regressão

X 2
+
X 2 + +
+
+
+ + +
+ +
+ + + - + +
++ + + + -
+ + - - +
+ ++ + +
+
- +
- +
X 1 X 1 X 1
■ k-th Nearest Neighbour ■ Linear Discriminant Analysis, QDA ■ Classical Linear Regression
■ Parzen Window
■ Logistic Regression (Logit) ■ Ridge Regression
■ Unfolding, Conjoint Analysis,
Cat-PCA ■ Decision Trees, LSSVM, NN, VS ■ NN, CART
TIPOS DE
MODELOS
REGRESSÃO
CLASSIFICAÇÃO
AGRUPAMENTO
19
CLASSIFICAÇÃO VS REGRESSÃO
20
O QUE É
REGRESSÃO
?
21
https://plot.ly/pandas/line-and-scatter/
REGRESSÃO
https://plot.ly ('Coefficients: \n', array([ 0.00118801]))

REGRESSÃO Mean squared error: 9.71
Variance score: -2.38
REGRESSÃO
24
QUAL A MELHOR PREDIÇÃO
25
REGRESSION ERROR
26
BIAS E OVERFITTING
27
OCCAM'S RAZOR
• Se os resultados forem
semelhantes escolha a
solução mais simples.
• Em Data Science prefira
sempre o modelo mais
simples
CORRELAÇÃO DOS DADOS
29
O QUE É
CLASSIFICAÇÃO
?
30
CLASSIFICAÇÃO
A A
B A
?
B B
31
CLASSIFICADORES
QUAL A TAG A APLICAR?
?
QUAL A TAG A APLICAR?
A B
A ?
B
C
DADOS DE TREINO VS DADOS DE TESTE

• Dados de Treino
• Usados para treinar um
modelo
• Exemplos
• Dados de Teste
• Usados para testar a
performance do modelo
• Dados de validação.
Classificação
Nome Idade Renda Profissão Classe

Daniel ≤ 30 Média Estudante Sim
João 31..50 Média-Alta Professor Sim
Carlos 31..50 Média-Alta Engenheiro Sim
Maria 31..50 Baixa Vendedora Não
Paulo ≤ 30 Baixa Porteiro Não

Otavio > 60 Média-Alta Aposentado Não
SE. Idade ≤ 30 E Renda é Média ENTÃO Compra-Produto-Eletrônico = SIM.

APRENDIZADO
SUPERVISIONADO
VS
NÃO
SUPERVISIONADO
38
APRENDIZADO SUPERVISIONADO VS NÃO SUPERVISIONADO

• Supervisionado
• Conhecimento das entradas e
saídas de dados
• Os dados possuem um label
• O objetivo é predizer a classe
ou o label do dado
• Não Supervisionado
• Sem conhecimento prévio
dos dados
• O objetivo é determinar
padrões
O QUE É
AGRUPAMENTO
?
40
AGRUPAMENTO
Processo de agrupar objetos com características semelhantes
CLUSTER
COMO AGRUPAR?
DISTÂNCIA
MODELOS DE AGRUPAMENTO
ANÁLISE
DE
TEXTO
47
Uma mutação aparentemente insignificante no
ANALISANDO TEXTO
Passados quase três meses do final dos Jogos, o Comitê
DNA dos ancestrais da humanidade
pode ter contribuído para que nosso cérebro
alcançasse o tamanho descomunal que tem
Rio-2016 ainda deve reembolso a hoje (três vezes maior que o dos grandes
macacos).
8.000 torcedores que utilizaram sua Bastou inserir o gene que contém essa
plataforma online para revender ingressos.
mutação em fetos de camundongo para que
dobrasse o número de células que dão origem
A entidade reduziu o contingente de consumidores a
aos neurônios do córtex, a área cerebral mais
quem devia pagamentos, que chegou a 140 mil em 19 de
"nobre".
outubro, data até a qual prometeu quitar os débitos. Mas
ainda não deu fim ao problema.
A entidade afirmou que tem dificuldades para ressarcir o

A pesquisa, conduzida por
restante. Alega problemas para encontrar os credores e
inconsistência nos dados bancários fornecidos —muitos cientistas do Instituto Max
depósitos não foram completados. Planck (Alemanha), é um dos primeiros frutos
da tentativa de usar o genoma para entender
como a evolução humana se desenrolou. Por
De acordo com o comitê , 3.500 pessoas foram enquanto, isso não tem sido fácil –tanto que o
procuradas mas não responderam às mensagens gene analisado pelos pesquisadores no novo
eletrônicas, 2.500 até deram retorno, porém as estudo, designado pela indigesta sigla
informações repassadas continham algum erro e 2.000 ARHGAP11B, é o único específico da linhagem
devem receber o reembolso até esta segunda (12), após humana a ser associado com a proliferação das
terem dados checados. tais células do córtex cerebral.
ANÁLISE DE SENTIMENTO
>>> docs_new = ['God is love', 'OpenGL on the GPU is fast']

>>> X_new_counts = count_vect.transform(docs_new)
>>> X_new_tfidf = tfidf_transformer.transform(X_new_counts)
>>> predicted = clf.predict(X_new_tfidf)
>>> for doc, category in zip(docs_new, predicted):

... print('%r => %s' % (doc, twenty_train.target_names[category]))
...
'God is love' => soc.religion.christian
'OpenGL on the GPU is fast' => comp.graphics
REPRESENTAÇÃO DE TEXTO
ANALISANDO A
PERFORMANCE
DE UM MODELO
51
ACURÁCIA
PERFORMANCE DE UM CLASSIFICADOR
• Acuracia = classificados corretamente /total de exemplos
• Erro = 1-Acuracia
PERFORMANCE DE UMA REGRESSÃO

Uma das formas de avaliar a qualidade do ajuste do modelo é através do coeficiente
de determinação. Basicamente, este coeficiente indica quanto o modelo foi capaz de
explicar os dados coletados. O coeficiente de determinação é dado pela expressão
Razão entre a soma de quadrados da
regressão e a soma de quadrados total.
TABELA CONFUSÃO
MEAN SQUARE ERROR

FERRAMENTAS
E LINGUAGENS
NUMPY
PANDAS
SKLEARN
R STUDIO
56
NUMPY
Biblioteca em python para
manipulação de arrays e
matrizes
PANDAS
Biblioteca de Manipulação
de dados e analise em
python
NUMPY E PANDAS- LENDO ARQUIVO CSV
import numpy as np
import pandas as pd
import visuals as vs # Supplementary code
from sklearn.cross_validation import ShuffleSplit
# Load the Boston housing dataset

data = pd.read_csv('housing.csv')
prices = data['MEDV']
features = data.drop('MEDV', axis = 1)
NUMPY E PANDAS- MÉDIA, MEDIANA E DESVIO PADRÃO

# TODO: Mean price of the data
mean_price = np.mean(prices)
# TODO: Median price of the data

median_price = np.median(prices)
# TODO: Standard deviation of prices of the data

std_price = np.std(prices)
# Show the calculated statistics

print "Statistics for Boston housing dataset:\n"
print "Mean price: ${:,.2f}".format(mean_price)
print "Median price ${:,.2f}".format(median_price)
print "Standard deviation of prices: ${:,.
2f}".format(std_price)
SKLEARN
• Aplicação simples e
eficiente para data mining e
data analysis
• Feito com NumPy, SciPy,
e matplotlib
• Open source,
commercially usable - BSD
license
ALGORITMOS DE AGRUPAMENTO DO SKLEARN

CLASSIFICAÇÃO NO SKLEARN
# Create a classifier: a support vector classifier
classifier = svm.SVC(gamma=0.001)
# We learn the digits on the first half of the digits

classifier.fit(data[:n_samples / 2], digits.target[:n_samples /
2])
CLASSIFICANDO EMAILS
from sklearn.naive_bayes import MultinomialNB
classifier = MultinomialNB()
…
classifier.fit(counts, targets)
examples = ['Free Viagra call today!', "I'm going to

attend the Linux users group tomorrow."]
….
predictions = classifier.predict(example_counts)
predictions # [1, 0]
HTTP://SCIKIT-
LEARN.ORG/STABLE/
LINGUAGEM R
R é uma linguagem e também
um ambiente de
desenvolvimento integrado
para cálculos estatísticos e
gráficos.
Foi criada originalmente por
Ross Ihaka e por Robert
Gentleman no departamento de
Estatística da universidade de
Auckland, Nova Zelândia.
EXEMPLOS
DE APLICAÇÃO
EXEMPLO DE PROBLEMA - ÁREA (BIOLOGIA) - TIPO: REGRESSÃO
ID SEXO CORAÇÃO PESO

library(“MASS")
1 F 2.0 7.0
2 F 2.0 7.4 data(cats)

3 F 2.0 9.5 R is a free software environment for
statistical computing and graphics. It
4 F 2.1 7.2
compiles and runs on a wide variety of
5 F 2.1 7.3 UNIX platforms, Windows and MacOS. To
download R, please choose your preferred
6 F 2.1 7.6 CRAN mirror.
7 F 2.1 8.1
8 F 2.1 8.2
67
PESO CORAÇÃO
Peso
ggplot(a,aes(a$Bwt,a$Hwt))+geom_point()
68
PESO CORAÇÃO
Peso
ggplot(a,aes(a$Bwt,a$Hwt))+geom_point()+geom_smooth()
69
70
EXEMPLO
PREDIZER
PREÇOS DE
CASAS EM
BOSTON
EXEMPLO DE PROBLEMA - PREÇO DE IMÓVEIS EM BOSTON
• RM' -Média do número de quartos
•'LSTAT' percentual de proprietários considerados "lower class" (working poor).
•'PTRATIO' razão do número de estudantes por professor no bairro
RM LMRATIO PTRATIO PREÇO

6.575 4.98 15.3 504000
6.421 9.14 17.8 453600
7.185 4.03 17.8 728700
6.998 2.94 18.7 701400
7.147 5.33 18.7 760200
6.43 5.21 18.7 602700
6.012 12.43 15.2 480900
6.172 19.15 15.2 569100
5.631 29.93 15.2 346500
6.004 17.1 15.2 396900
6.377 20.45 15.2 315000
6.009 13.27 15.2 396900
5.889 15.71 15.2 455700
5.949 8.26 21 428400
6.096 10.26 21 382200
ARVORES DE DECISÃO
PREÇOS DAS CASAS EM BOSTON

CÓDIGO DE REGRESSÃO NO SKLEARN
regressor = DecisionTreeRegressor(random_state=42)
params = {"max_depth": [1,2,3,4,5,6,7,8,9,10]}
# TODO: Create the grid search object

grid = GridSearchCV(regressor,
param_grid=params, scoring =
scoring_fnc, cv = cv_sets)
# Fit the grid search object to the data to compute the optimal
model
grid = grid.fit(X, y)
PREDICTING
# Produce a matrix for client data

client_data = [[5, 17, 15], # Client 1
[4, 32, 22], # Client 2
[8, 3, 12]] # Client 3
# Show predictions
for i, price in
enumerate(reg.predict(client_data)):
Predicted selling price for Client 1's home: $414,473.68

EXEMPLO
INTERVENÇÃO
DE
ESTUDANTES
INTERVENÇÃO DE ESTUDANTES
Feature values:
school sex age address famsize Pstatus Medu Fedu Mjob Fjob \
0 GP F 18 U GT3 A 4 4 at_home teacher
1 GP F 17 U GT3 T 1 1 at_home other
2 GP F 15 U LE3 T 1 1 at_home other
3 GP F 15 U GT3 T 4 2 health services
4 GP F 16 U GT3 T 3 3 other other
... higher internet romantic famrel freetime goout Dalc Walc health \
0 ... yes no no 4 3 4 1 1 3
1 ... yes yes no 5 3 3 1 1 3
2 ... yes yes no 4 3 2 2 3 3
3 ... yes yes yes 3 2 2 1 1 5
4 ... yes no no 4 3 2 1 2 5
absences
0 6
1 4
2 10
3 2
4 4
PREPARANDO OS DADOS
# If data type is non-numeric, replace all yes/no values with 1/0
if col_data.dtype == object:
col_data = col_data.replace(['yes', 'no'], [1, 0])
SEPARANDO DADOS DE TREINO E TESTE

X_train, X_test, y_train, y_test = train_test_split(X_all, y_all,
stratify=y_all, train_size=train_size,test_size=0.24)
INICIALIZANDO MODELOS
clf_A = svm.SVC(random_state=42)
clf_B = tree.DecisionTreeClassifier(random_state=42)
clf_C = AdaBoostClassifier(tree.DecisionTreeClassifier(max_depth=1),
algorithm="SAMME",
n_estimators=300,random_state=42)
clf_D=KNeighborsClassifier(n_neighbors=3)
clf_E= GaussianNB()
clf_F=RandomForestClassifier(n_estimators=100,random_state=42)
clf_G=LogisticRegression(random_state=42)
COMPARAÇÃO DE MODELOS
Logistic Regression Pros:
• Implementação eficiente
Logistic Regression Cons:
• Não é performático com muitas features
Decision Trees Pros:
• Regras de decisão intuitivas
• Pode utilizar campos não lineares
Decision Trees Cons:
• Alto Bias [Random Forests pode ser a solução]
• Sem ranking score
SVM Pros:
• Pode lidar com um grande número de features
SVM Cons:
• Não é performático em um dataset com um maior número de
linhas
PERFORMANCE DOS CLASSIFICADORES

SEGMENTANDO
FORNECEDORES
SEGMENTANDO FORNECEDORES
BANCO DE DADOS DE PRODUTOS

SEGMENTANDO FORNECEDORES
VERIFICANDO CORRELACIONAMENTO
PCA -REDUZINDO DIMENSÕES
GERANDO OS CLUSTERS
clusterer = KMeans(n_clusters=i,
random_state=29).fit(reduced_data)
preds = clusterer.predict(reduced_data)
SITES
ONDE
CONSEGUIR
INFORMAÇÃO
HTTPS://ENSINANDOMAQUINASBLOG.WORDPRESS.COM
ONDE DESCOBRIR NOVAS INFORMAÇÕES
KAGGLE
RELATÓRIOS
RELATÓRIOS
▸ Notebooks online: iPython, Jupyter
▸ permitem a criação de
documentos
▸ interativos em várias linguagens

de análise.
▸ ¨ Reproducible Research!
TESTE A/B
TESTE A/B
TESTE A/B
TESTE A/B
Teste A/B é um método de teste onde se comparam duas práticas, A e B,
em que estes são o controle e o tratamento de uma experiência
controlada, com o objetivo de melhorar a percentagem de aprovação.
LIVROS
Dúvidas:
naubergois@gmail.com

Datascience 161212042051

Enviado por

Direitos autorais:

Formatos disponíveis

Datascience 161212042051

Enviado por

Dados do documento

Descrição original:

Título original

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

Datascience 161212042051

Enviado por

Direitos autorais:

Formatos disponíveis

DESMISTIFICANDO A CIÊNCIA

Data scientist: Pessoa que é melhor em

Data scientist: Person who is better at statistics than any

software engineer and better at software engineering than

any statistician – Josh Wills

DATA SCIENCE VENN DIAGRAM

Seleção de Features e Redução de

COLETAR E PROCESSAR OS DADOS

Coletar amostras de uma população.

Transformar , filtrar e sumarizar os dados.

Preparar os dados para o modelo escolhido

ESCOLHA DO TIPO DE MODELO A SER APLICADO

Common Data Mining tasks

Clustering Classificação Regressão

https://plot.ly ('Coefficients: \n', array([ 0.00118801]))

QUAL A MELHOR PREDIÇÃO

CORRELAÇÃO DOS DADOS

QUAL A TAG A APLICAR?

QUAL A TAG A APLICAR?

DADOS DE TREINO VS DADOS DE TESTE

Nome Idade Renda Profissão Classe

Carlos 31..50 Média-Alta Engenheiro Sim

Maria 31..50 Baixa Vendedora Não

Paulo ≤ 30 Baixa Porteiro Não

SE. Idade ≤ 30 E Renda é Média ENTÃO Compra-Produto-Eletrônico = SIM.

APRENDIZADO SUPERVISIONADO VS NÃO SUPERVISIONADO

A entidade afirmou que tem dificuldades para ressarcir o

>>> docs_new = ['God is love', 'OpenGL on the GPU is fast']

>>> predicted = clf.predict(X_new_tfidf)

>>> for doc, category in zip(docs_new, predicted):

PERFORMANCE DE UMA REGRESSÃO

MEAN SQUARE ERROR

NUMPY E PANDAS- LENDO ARQUIVO CSV

# Load the Boston housing dataset

NUMPY E PANDAS- MÉDIA, MEDIANA E DESVIO PADRÃO

# TODO: Median price of the data

# TODO: Standard deviation of prices of the data

# Show the calculated statistics

ALGORITMOS DE AGRUPAMENTO DO SKLEARN

# We learn the digits on the first half of the digits

from sklearn.naive_bayes import MultinomialNB

examples = ['Free Viagra call today!', "I'm going to

ID SEXO CORAÇÃO PESO

2 F 2.0 7.4 data(cats)

RM LMRATIO PTRATIO PREÇO

PREÇOS DAS CASAS EM BOSTON

CÓDIGO DE REGRESSÃO NO SKLEARN

params = {"max_depth": [1,2,3,4,5,6,7,8,9,10]}

# TODO: Create the grid search object

# Produce a matrix for client data

Predicted selling price for Client 1's home: $414,473.68

# If data type is non-numeric, replace all yes/no values with 1/0

col_data = col_data.replace(['yes', 'no'], [1, 0])

SEPARANDO DADOS DE TREINO E TESTE

PERFORMANCE DOS CLASSIFICADORES

BANCO DE DADOS DE PRODUTOS

▸ interativos em várias linguagens

Você também pode gostar