Network Evolution (Preferential Attachment Model, Small World Model, Link Prediction)

Original Source: https://www.coursera.org/specializations/data-science-python

Degree Distributions

The degree of a node in an undirected graph is the number of neighbors it has.

The degree distribution of a graph is the probability distribution of the degrees over the entire network.

graph

The degree distribution, $P(k)$, of the above network has the following values:

$P(1)=\frac{1}{9}, P(2)=\frac{4}{9}, P(3)=\frac{1}{3}, P(4)=\frac{1}{9}$

Directed Graph

Basically same as degree in undirected graph. Only difference is that there are in-degree (distribution) and out-degree (distribution).

import networkx as nx
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
%matplotlib inline

G=nx.Graph()
G.add_edges_from([('A','B'), ('A','G'), ('A','H'), ('B','C'), ('C','D'), ('C','E'), ('D','F'), ('F','G'), ('G','H'), ('G','I')])

degrees = dict(G.degree()) # for directed graphs, G.in-degree() or G.out-degree()
degree_values = sorted(set(degrees.values()))
histogram = [list(degrees.values()).count(i)/float(nx.number_of_nodes(G)) for i in degree_values]

plt.figure()
plt.bar(degree_values,histogram)
plt.title('Degree Distribution of Graph G')
plt.xlabel('Degree')
plt.ylabel('Fraction of Nodes')
plt.show()

png

Degree Distributions in Real Networks

degree distributions in real networks A – Actors: network of 225,000 actors connected when they appear in a movie together.
B – The Web: network of 325,000 documents on the WWW connected by URLs.
C – US Power Grid: network of 4,941 generators connected by transmission lines.

Degree distributions in real world tend to look like a straight line when on a log-log scale(both x and y axis are on log scale) as shown above. They tend to follow ‘Power Law’.
Power law: $P(k)=Ck^{-\alpha}$, where $\alpha$ and $C$ are constants.

Networks with power law distribution have many nodes with small degree and a few nodes with very large degree.

Preferential Attachment Model

Preferential Attachment Model explains how networks that follow power law were created.

How does this model creates networks?

  • Start with two nodes connected by an edge.
  • At each time step, add a new node with an edge connecting it to an existing node.
  • Choose the node to connect to at random with probability proportional to each node’s degree.
  • The probability of connecting to a node $u$ of degree $k_u$ is $\frac{k_u}{\sum_j k_j}$.

As the number of nodes increases, the degree distribution of the network under the preferential attachment model approaches the power law $P(k)=Ck^{-3}$.

Application in Networkx

Specifically we can use nx.barabasi_albert_graph in networkx to produce such graphs.
barabasi_albert_graph(n, m) returns a network with n nodes. Each new node attaches to m existing nodes according to the Preferential Attachment model.

G = nx.barabasi_albert_graph(1000000,10)

degrees = dict(G.degree())
degree_values = sorted(set(degrees.values()))
histogram = [list(degrees.values()).count(i)/float(nx.number_of_nodes(G)) for i in degree_values]

plt.figure()
plt.plot(degree_values,histogram, 'o')
plt.title('Degree Distribution of Barbasi Albert Graph')
plt.xlabel('Degree')
plt.ylabel('Fraction of Nodes')
plt.xscale('log')
plt.yscale('log')
plt.show()

png

Small World Networks

‘Small World’ networks has two properties.

  1. Average length of shortest paths between possible pairs of nodes in a network is small.
  2. Clustering coefficient is small. (Local clustering coefficient of a node is fraction of pairs of the node’s friends that are friends with each other.)

Proofs of Small World

1. Milgram Small World Experiment (1960)

296 randomly chosen “starters” were asked to forward a letter to a “target” person. Target was a stockbroker in Boston. Instructions for starter:

  • Send letter to target if you know him on a first name basis.
  • If you do not know target, send letter (and instructions) to someone you know on a first name basis who is more likely to know the target. Some information about the target, such as city, and occupation, was provided.

Result: Median chain length was 6.

2. Microsoft Instant Messenger (2008)

Nodes: 240 million active users on Microsoft Instant Messenger.
Edges: Users engaged in two-way communication over a one-month period.

Result: Estimated median path length of 7.

3. Facebook Global Network (2012)

Result: Average path length in 2008 was 5.28 and in 2011 it was 4.74.

4. Real World Cases of Low Clustering Coefficient

  • Facebook 2011: High average CC (decreases with degree; 0.5 when degree=2, 0.2 when degree=50 …)
  • Microsoft Instant Message: Average CC of 0.13.
  • IMDB actor network: Average CC 0.78

Does Preferential Attachment Model Reproduces Small World?

Graphs created by Preferential Attachment model have small average shortest path length. But since no mechanism in the Preferential Attachment model favors triangle formation, it had low average clustering.

Small World Model

Small World Model was designed to accomplish two properties of Small World Network; high clustering coefficient and small average shortest paths.

How does this model creates networks?

  • Start with a ring of $n$ nodes, where each node is connected to its $k$ nearest neighbors.
  • Fix a parameter $p\in[0,1]$.
  • Consider each edge $(u,v)$. With probability $p$, select a node $w$ at random and rewire the edge $(u,v)$ so it becomes $(u,w)$.

Variable $p$

The resulting network varies with value $p$ as shown below.

small world model

As $p$ increases from 0 to 0.01:

  • average shortest path decreases rapidly.
  • average clustering coefficient deceases slowly.

p varies

Application in Networkx

We can use nx.watts_strogatz_graph in networkx to produce such graphs.
watts_strogatz_graph(n, k, p) returns a small world network with n nodes, starting with a ring lattice with each node connected to its k nearest neighbors, and rewiring probability p.

G = nx.watts_strogatz_graph(100,5,0.1)
print('average shortest path length: {}'.format(nx.average_shortest_path_length(G)))
print('average clustering coefficient: {}'.format(nx.average_clustering(G)))
plt.figure(figsize=(14, 8))
nx.draw_networkx(G, pos=nx.circular_layout(G), with_labels=False, alpha=0.5)
average shortest path length: 4.893535353535354
average clustering coefficient: 0.36333333333333334

png

Small world networks can be disconnected, which is sometime undesirable. connected_watts_strogatz_graph(n, k, p, t) runs watts_strogatz_graph(n, k, p) up to t times, until it returns a connected small world network.

newman_watts_strogatz_graph(n, k, p) runs a model similar to the small world model, but rather than rewiring edges, new edges are added with probability p.

Link Prediction

Given a pair of nodes, how can we assess whether they are likely to connect?

Ideas are based on ‘Triadic Closure’, which is the tendency for people who share connections in a social network to become connected.

I’ll introduce 7 measures, 5 of which are commonly applicable and 2 applicable if given specific information.

The graph I will use is shown below.

graph

$N(X)$ is the set of neighbors of node $X$.

1. Common Neighbors

$comm neigh(X,Y)=|N(X)\cap N(Y)|$

G = nx.Graph()
G.add_edges_from([('A','B'), ('A','D'), ('A','E'), ('B','C'), ('B','D'), ('C','D'), ('C','F'), ('E','F'), ('E','G'), ('G','F'), ('G','I'), ('G','H')])

# Targets are pairs of nodes that are not connected directly by an edge.
targets = nx.non_edges(G)
common_neigh = [(e[0], e[1], len(list(nx.common_neighbors(G, e[0], e[1])))) for e in targets]
sorted(common_neigh, key=lambda x: x[2], reverse=True)
[('C', 'A', 2),
 ('H', 'I', 1),
 ('H', 'F', 1),
 ('H', 'E', 1),
 ('D', 'F', 1),
 ('D', 'E', 1),
 ('G', 'C', 1),
 ('G', 'A', 1),
 ('C', 'E', 1),
 ('I', 'F', 1),
 ('I', 'E', 1),
 ('B', 'F', 1),
 ('B', 'E', 1),
 ('F', 'A', 1),
 ('H', 'D', 0),
 ('H', 'C', 0),
 ('H', 'B', 0),
 ('H', 'A', 0),
 ('D', 'G', 0),
 ('D', 'I', 0),
 ('G', 'B', 0),
 ('C', 'I', 0),
 ('I', 'B', 0),
 ('I', 'A', 0)]

2. Jaccard Coefficient

$jacc coeff(X,Y)=\frac{|N(X)\cap N(Y)|}{|N(X)\cup N(Y)|}$

jaccard_coefficient = list(nx.jaccard_coefficient(G))
sorted(jaccard_coefficient, key=lambda x: x[2], reverse=True)
[('H', 'I', 1.0),
 ('C', 'A', 0.5),
 ('H', 'F', 0.3333333333333333),
 ('H', 'E', 0.3333333333333333),
 ('I', 'F', 0.3333333333333333),
 ('I', 'E', 0.3333333333333333),
 ('D', 'F', 0.2),
 ('D', 'E', 0.2),
 ('C', 'E', 0.2),
 ('B', 'F', 0.2),
 ('B', 'E', 0.2),
 ('F', 'A', 0.2),
 ('G', 'C', 0.16666666666666666),
 ('G', 'A', 0.16666666666666666),
 ('H', 'D', 0.0),
 ('H', 'C', 0.0),
 ('H', 'B', 0.0),
 ('H', 'A', 0.0),
 ('D', 'G', 0.0),
 ('D', 'I', 0.0),
 ('G', 'B', 0.0),
 ('C', 'I', 0.0),
 ('I', 'B', 0.0),
 ('I', 'A', 0.0)]

3. Resource Allocation

Fraction of a ‘resource’ that a node can send to another through their common neighbors.

$res alloc(X,Y)=\sum_{u\in{N(X)\cap N(Y)}}\frac{1}{|N(u)|}$

resource_allocation_index = list(nx.resource_allocation_index(G))
sorted(resource_allocation_index, key=lambda x: x[2], reverse=True)
[('C', 'A', 0.6666666666666666),
 ('D', 'F', 0.3333333333333333),
 ('D', 'E', 0.3333333333333333),
 ('G', 'C', 0.3333333333333333),
 ('G', 'A', 0.3333333333333333),
 ('C', 'E', 0.3333333333333333),
 ('B', 'F', 0.3333333333333333),
 ('B', 'E', 0.3333333333333333),
 ('F', 'A', 0.3333333333333333),
 ('H', 'I', 0.25),
 ('H', 'F', 0.25),
 ('H', 'E', 0.25),
 ('I', 'F', 0.25),
 ('I', 'E', 0.25),
 ('H', 'D', 0),
 ('H', 'C', 0),
 ('H', 'B', 0),
 ('H', 'A', 0),
 ('D', 'G', 0),
 ('D', 'I', 0),
 ('G', 'B', 0),
 ('C', 'I', 0),
 ('I', 'B', 0),
 ('I', 'A', 0)]

4. Adamic-Adar Index

Similar to resource allocation index, but with log in the denominator.

$adamic adar(X,Y)=\sum_{u\in{N(X)\cap N(Y)}}\frac{1}{\log(|N(u)|)}$

adamic_adar_index = list(nx.adamic_adar_index(G))
sorted(adamic_adar_index, key=lambda x: x[2], reverse=True)
[('C', 'A', 1.8204784532536746),
 ('D', 'F', 0.9102392266268373),
 ('D', 'E', 0.9102392266268373),
 ('G', 'C', 0.9102392266268373),
 ('G', 'A', 0.9102392266268373),
 ('C', 'E', 0.9102392266268373),
 ('B', 'F', 0.9102392266268373),
 ('B', 'E', 0.9102392266268373),
 ('F', 'A', 0.9102392266268373),
 ('H', 'I', 0.7213475204444817),
 ('H', 'F', 0.7213475204444817),
 ('H', 'E', 0.7213475204444817),
 ('I', 'F', 0.7213475204444817),
 ('I', 'E', 0.7213475204444817),
 ('H', 'D', 0),
 ('H', 'C', 0),
 ('H', 'B', 0),
 ('H', 'A', 0),
 ('D', 'G', 0),
 ('D', 'I', 0),
 ('G', 'B', 0),
 ('C', 'I', 0),
 ('I', 'B', 0),
 ('I', 'A', 0)]

5. Pref.Attachment

$pref attach(X,Y)=|N(X)||N(Y)|$

preferential_attachment = list(nx.preferential_attachment(G))
sorted(preferential_attachment, key=lambda x: x[2], reverse=True)
[('D', 'G', 12),
 ('G', 'B', 12),
 ('G', 'C', 12),
 ('G', 'A', 12),
 ('D', 'F', 9),
 ('D', 'E', 9),
 ('C', 'A', 9),
 ('C', 'E', 9),
 ('B', 'F', 9),
 ('B', 'E', 9),
 ('F', 'A', 9),
 ('H', 'D', 3),
 ('H', 'C', 3),
 ('H', 'B', 3),
 ('H', 'F', 3),
 ('H', 'E', 3),
 ('H', 'A', 3),
 ('D', 'I', 3),
 ('C', 'I', 3),
 ('I', 'B', 3),
 ('I', 'F', 3),
 ('I', 'E', 3),
 ('I', 'A', 3),
 ('H', 'I', 1)]

Some measures consider the community structure of the network for link prediction.
Assume the nodes in this network belong to different communities (sets of nodes).
Pairs of nodes who belong to the same community and have many common neighbors in their community are likely to form an edge.

In our graph G, let’s assume that there are two communities, {A,B,C,D} and {E,F,G,H,I}.

6. Community Common Neighbors

$cn soundarajan hopcroft(X,Y)=|N(X)\cap N(Y)|+\sum_{u\in N(X)\cap N(Y) f(u)}$
where $f(u)=1$ if $u,X,Y$ are in same community, and $f(u)=0$ otherwise.

G.node['A']['community'] = 0
G.node['B']['community'] = 0
G.node['C']['community'] = 0
G.node['D']['community'] = 0
G.node['E']['community'] = 1
G.node['F']['community'] = 1
G.node['G']['community'] = 1
G.node['H']['community'] = 1
G.node['I']['community'] = 1

cn_soundarajan_hopcroft = list(nx.cn_soundarajan_hopcroft(G))
sorted(cn_soundarajan_hopcroft, key=lambda x: x[2], reverse=True)
[('C', 'A', 4),
 ('H', 'I', 2),
 ('H', 'F', 2),
 ('H', 'E', 2),
 ('I', 'F', 2),
 ('I', 'E', 2),
 ('D', 'F', 1),
 ('D', 'E', 1),
 ('G', 'C', 1),
 ('G', 'A', 1),
 ('C', 'E', 1),
 ('B', 'F', 1),
 ('B', 'E', 1),
 ('F', 'A', 1),
 ('H', 'D', 0),
 ('H', 'C', 0),
 ('H', 'B', 0),
 ('H', 'A', 0),
 ('D', 'G', 0),
 ('D', 'I', 0),
 ('G', 'B', 0),
 ('C', 'I', 0),
 ('I', 'B', 0),
 ('I', 'A', 0)]

Similar to resource allocation index, but only considering nodes in the same community

$cn soundarajan hopcroft(X,Y)=\sum_{u\in{N(X)\cap N(Y)}}\frac{f(u)}{|N(u)|}$
where $f(u)=1$ if $u,X,Y$ are in same community, and $f(u)=0$ otherwise.

ra_index_soundarajan_hopcroft = list(nx.ra_index_soundarajan_hopcroft(G))
sorted(ra_index_soundarajan_hopcroft, key=lambda x: x[2], reverse=True)
[('C', 'A', 0.6666666666666666),
 ('H', 'I', 0.25),
 ('H', 'F', 0.25),
 ('H', 'E', 0.25),
 ('I', 'F', 0.25),
 ('I', 'E', 0.25),
 ('H', 'D', 0),
 ('H', 'C', 0),
 ('H', 'B', 0),
 ('H', 'A', 0),
 ('D', 'G', 0),
 ('D', 'F', 0),
 ('D', 'I', 0),
 ('D', 'E', 0),
 ('G', 'B', 0),
 ('G', 'C', 0),
 ('G', 'A', 0),
 ('C', 'I', 0),
 ('C', 'E', 0),
 ('I', 'B', 0),
 ('I', 'A', 0),
 ('B', 'F', 0),
 ('B', 'E', 0),
 ('F', 'A', 0)]

Excercise - Predicting Salary and Future Connections via Company Emails

import networkx as nx
import pandas as pd
import numpy as np
import pickle
import matplotlib.pyplot as plt

%matplotlib inline

File email_prediction.txt contains company’s email network where each node corresponds to a person at the company, and each edge indicates that at least one email has been sent between two people.

The network also contains the node attributes Department and ManagementSalary.

Department indicates the department in the company which the person belongs to, and ManagementSalary indicates whether that person is receiving a management position salary.

G = nx.read_gpickle('email_prediction.txt')

print(nx.info(G))
Name:
Type: Graph
Number of nodes: 1005
Number of edges: 16706
Average degree:  33.2458

1. Salary Prediction

Using network G, identify the people in the network with missing values for the node attribute ManagementSalary and predict whether or not these individuals are receiving a management position salary.

# Feature Extraction

node_data = np.array(G.nodes(data=True))
df = pd.DataFrame({'ManagementSalary': [d['ManagementSalary'] for d in node_data[:, 1]],\
                   'department': [d['Department'] for d in node_data[:, 1]]},\
                  index=node_data[:, 0])

df['clustering'] = pd.Series(nx.clustering(G))
df['degree'] = pd.Series(G.degree())
df['degree_centrality'] = pd.Series(nx.degree_centrality(G))
df['closeness'] = pd.Series(nx.closeness_centrality(G, normalized=True))
df['betweeness'] = pd.Series(nx.betweenness_centrality(G, normalized=True))
df['pr'] = pd.Series(nx.pagerank(G))

df.head()
ManagementSalary department clustering degree degree_centrality closeness betweeness pr
0 0.0 1 0.276423 44 0.043825 0.421991 0.001124 0.001224
1 NaN 1 0.265306 52 0.051793 0.422360 0.001195 0.001426
2 NaN 21 0.297803 95 0.094622 0.461490 0.006570 0.002605
3 1.0 21 0.384910 71 0.070717 0.441663 0.001654 0.001833
4 1.0 21 0.318691 96 0.095618 0.462152 0.005547 0.002526
test_data = df[df['ManagementSalary'].isnull()]
X_test = test_data.iloc[:, 1:]
X_test.head()
department clustering degree degree_centrality closeness betweeness pr
1 1 0.265306 52 0.051793 0.422360 0.001195 0.001426
2 21 0.297803 95 0.094622 0.461490 0.006570 0.002605
5 25 0.107002 171 0.170319 0.501484 0.030995 0.004914
8 14 0.447059 37 0.036853 0.413151 0.000557 0.001059
14 4 0.215784 80 0.079681 0.442068 0.003726 0.002166
train_data = df.dropna()
X_train_dev = train_data.iloc[:, 1:]
y_train_dev = train_data['ManagementSalary']
y_train_dev.head()
0    0.0
3    1.0
4    1.0
6    1.0
7    0.0
Name: ManagementSalary, dtype: float64
from sklearn.model_selection import train_test_split
X_train, X_dev, y_train, y_dev = train_test_split(X_train_dev, y_train_dev, random_state=0)
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.metrics import roc_auc_score

def model_eval(model):
    model.fit(X_train, y_train)
    train_prob_prediction = model.predict_proba(X_train)[:, 1]
    dev_prob_prediction = model.predict_proba(X_dev)[:, 1]

    print(model)
    print('train score: {}'.format(roc_auc_score(y_train, train_prob_prediction)))
    print('dev score: {}'.format(roc_auc_score(y_dev, dev_prob_prediction)))
    print('')

model_eval(RandomForestClassifier(random_state=0))
model_eval(GradientBoostingClassifier(random_state=0))
model_eval(LogisticRegression(random_state=0))
model_eval(SVC(kernel='linear', probability=True, random_state=0))
model_eval(SVC(probability=True, random_state=0))
model_eval(KNeighborsClassifier())
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=10, n_jobs=1, oob_score=False, random_state=0,
            verbose=0, warm_start=False)
train score: 0.9998817267888823
dev score: 0.8644654088050315

GradientBoostingClassifier(criterion='friedman_mse', init=None,
              learning_rate=0.1, loss='deviance', max_depth=3,
              max_features=None, max_leaf_nodes=None,
              min_impurity_split=1e-07, min_samples_leaf=1,
              min_samples_split=2, min_weight_fraction_leaf=0.0,
              n_estimators=100, presort='auto', random_state=0,
              subsample=1.0, verbose=0, warm_start=False)
train score: 0.9999526907155529
dev score: 0.9484276729559749

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=0, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)
train score: 0.8845416913069191
dev score: 0.8438155136268344

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma='auto', kernel='linear',
  max_iter=-1, probability=True, random_state=0, shrinking=True, tol=0.001,
  verbose=False)
train score: 0.8905263157894737
dev score: 0.8473794549266247

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=True, random_state=0, shrinking=True, tol=0.001,
  verbose=False)
train score: 0.9913897102306326
dev score: 0.8220125786163521

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform')
train score: 0.9455233589591958
dev score: 0.7771488469601677
model = GradientBoostingClassifier(random_state=0)
model.fit(X_train_dev, y_train_dev)

# Result
prediction = pd.Series(model.predict_proba(X_test)[:,1], index=X_test.index)
prediction.head()
1     0.005034
2     0.987573
5     0.987976
8     0.117562
14    0.053312
dtype: float64

2. New Connections Prediction

Predict future connections between employees of the network. The future connections information has been loaded into the variable future_connections. The index is a tuple indicating a pair of nodes that currently do not have a connection, and the Future Connection column indicates if an edge between those two nodes will exist in the future, where a value of 1.0 indicates a future connection.

future_connections = pd.read_csv('Future_Connections.csv', index_col=0, converters={0: eval})
future_connections.head()
Future Connection
(6, 840) 0.0
(4, 197) 0.0
(620, 979) 0.0
(519, 872) 0.0
(382, 423) 0.0

Using network G and future_connections, identify the edges in future_connections with missing values and predict whether or not these edges will have a future connection.

# Feature Extraction

common_neighbors = [len(list(nx.common_neighbors(G, edge[0], edge[1]))) for edge in future_connections.index]
jaccard_coefficient = [item[2] for item in list(nx.jaccard_coefficient(G, ebunch=future_connections.index))]
resource_allocation_index = [item[2] for item in list(nx.resource_allocation_index(G, ebunch=future_connections.index))]
adamic_adar_index = [item[2] for item in list(nx.adamic_adar_index(G, ebunch=future_connections.index))]
preferential_attachment = [item[2] for item in list(nx.preferential_attachment(G, ebunch=future_connections.index))]

future_connections['Common Neighbors'] = common_neighbors
future_connections['Jaccard Coefficient'] = jaccard_coefficient
future_connections['Resource Allocation Index'] = resource_allocation_index
future_connections['Adamic Adar Index'] = adamic_adar_index
future_connections['Preferential Attachment'] = preferential_attachment

future_connections.head()
Future Connection Common Neighbors Jaccard Coefficient Resource Allocation Index Adamic Adar Index Preferential Attachment
(6, 840) 0.0 9 0.073770 0.136721 2.110314 2070
(4, 197) 0.0 2 0.015504 0.008437 0.363528 3552
(620, 979) 0.0 0 0.000000 0.000000 0.000000 28
(519, 872) 0.0 2 0.060606 0.039726 0.507553 299
(382, 423) 0.0 0 0.000000 0.000000 0.000000 205
test_data = future_connections[future_connections['Future Connection'].isnull()]
train_data = future_connections.dropna()
X_train_dev = train_data.iloc[:, 1:]
y_train_dev = train_data.iloc[:, 0]
X_test = test_data.iloc[:, 1:]
X_train, X_dev, y_train, y_dev = train_test_split(X_train_dev, y_train_dev, random_state=0)
model_eval(RandomForestClassifier(random_state=0))
model_eval(GradientBoostingClassifier(random_state=0))
model_eval(LogisticRegression(random_state=0))
model_eval(KNeighborsClassifier())
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=10, n_jobs=1, oob_score=False, random_state=0,
            verbose=0, warm_start=False)
train score: 0.9755590677929923
dev score: 0.8585516063855411

GradientBoostingClassifier(criterion='friedman_mse', init=None,
              learning_rate=0.1, loss='deviance', max_depth=3,
              max_features=None, max_leaf_nodes=None,
              min_impurity_split=1e-07, min_samples_leaf=1,
              min_samples_split=2, min_weight_fraction_leaf=0.0,
              n_estimators=100, presort='auto', random_state=0,
              subsample=1.0, verbose=0, warm_start=False)
train score: 0.910763887574075
dev score: 0.9104536718240157

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=0, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)
train score: 0.9063953187829403
dev score: 0.9062667673923487

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform')
train score: 0.9400144420004987
dev score: 0.8588742866325493
model = GradientBoostingClassifier(random_state=0)
model.fit(X_train_dev, y_train_dev)

# Result
prediction = pd.Series(model.predict_proba(X_test)[:,1], index=X_test.index)
prediction.head()
(107, 348)    0.030342
(542, 751)    0.013050
(20, 426)     0.539323
(50, 989)     0.013050
(942, 986)    0.013050
dtype: float64

Leave a Comment