Many real-world phenomena are described by pairwise proximity data, modeling interactions between the entities of the system. This in contrast to the more common situation where each data sample is given as a feature vector. Even though the clustering of the proximity data may be performed directly on the data matrix, there are some advantatages of embedding the data into a vector space. For example, it enables the use of some standard preprocessing techniques such as denoising or dimensionality reduction. In this coding exercise, we will explore the tecnhique called _Constant Shift Embedding_ for restating pairwise clustering problems in vector spaces [1] while preserving the cluster structure. We will apply the algorithm described in [1] to cluster the groups of research community members based on the email correspondence matrix. The data and its description is given in [2].

### References

[1] [Optimal cluster preserving embedding of nonmetric proximity data](https://ieeexplore.ieee.org/document/1251147)

The number of nodes is hardcoded for simplicity (taken from [2]):

%% Cell type:code id: tags:

``` python

NUM_NODES=1005

```

%% Cell type:markdown id: tags:

We load the file which contains the list of interactions between the community members (nodes). Our data matrix represents an undirected graph which connects two nodes if there was at least one email sent between the two corresponding community members. Thus our data matrix is essentially an adjacency matrix.

%% Cell type:code id: tags:

``` python

# initialize data matrix which will be an adjacency matrix

DATA=np.zeros((NUM_NODES,NUM_NODES))

# fill out the symmetric adjacency matrix

withopen("email-Eu-core.txt")asfile:

forlineinfile:

pair=[int(x)forxinline.split()]

DATA[pair[0],pair[1]]=1

DATA[pair[1],pair[0]]=1

```

%% Cell type:markdown id: tags:

Note that DATA specifies an adjacency matrix of the email graph. It's not claimed to be a proper dissimilarity matrix required by CSE algorithm. So, you are allowed to perform any manipulations to construct a suitable (dis-)similarity matrix for the further analysis.

%% Cell type:markdown id: tags:

Next we define a class which contains main functionalities - TO BE IMPLEMENTED.

"""Template class for Constant Shift Embedding (CSE)

Attributes:

PMAT (np.ndarray): Proximity matrix used for calculating the embeddings.

S (np.ndarray): Similarity matrix.

D (np.ndarray): Dissimilarity matrix.

"""

def__init__(self):

self.PMAT=None

self.S=None

self.D=None

# Add/change parameters, if necessary.

deffit(self,PMAT):

""" Calculate similarity/dissimiliarity matrix and all

the necessary variables for calculating the embeddings.

Args:

PMAT (np.ndarray): proximity matrix

"""

# Save data

self.PMAT=PMAT

## IMPLEMENT THIS METHOD

defget_embedded_vectors(self,p):

"""Return embeddings

Args:

p (np.ndarray): cut-off value in eigenspectrum

Returns:

Xp (np.ndarray): embedded vectors

"""

## IMPLEMENT THIS METHOD

```

%% Cell type:markdown id: tags:

<h2style="background-color:#f0b375;">

Section 4.0

<spanstyle=font-size:50%> Complete all problems in this section to get a pass on this exercise. </span>

</h2>

<pstyle="background-color:#adebad;">Describe briefly and consicely the model given in [1]. Explain the main steps of _Constant Shift Embedding_ algorithm. See <ahref="https://medium.com/ibm-data-science-experience/markdown-for-jupyter-notebooks-cheatsheet-386c05aeebed">markdown cheatsheet</a> for text editing.</p>

-**Constant Shift Embedding**: (Put your answer here)

%% Cell type:markdown id: tags:

<pstyle="background-color:#adebad;">

Implement Constant Shift Embedding. We start off by making an instance of the corresponding class.

</p>

%% Cell type:code id: tags:

``` python

CSE=ConstantShiftEmbedding()

```

%% Cell type:markdown id: tags:

<pstyle="background-color:#adebad;">

Fit the data matrix. _fit(...)_ method computes necessary variables which can be later on used to produce embeddings [1].

</p>

%% Cell type:code id: tags:

``` python

CSE.fit(DATA)

```

%% Cell type:markdown id: tags:

<h2style="background-color:#f0b375;">

Section 4.5

<spanstyle=font-size:50%> Complete all problems in this and previous sections to get a grade of 4.5 </span>

</h2>

<pstyle="background-color:#adebad;">

Next, try to find approximately optimal $p = p^∗$, a cut-off value which removes noise from the data. To do that, produce an eigen-spectrum plot as shown in [1] figure 4a and briefly explain your choice of $p^∗$.

</p>

%% Cell type:code id: tags:

``` python

## Compute eigen-spectrum

```

%% Cell type:code id: tags:

``` python

## Determine a good cut-off value (and write some lines to explain your choice)

p_opt=1## change accordingly

print("Chosen cut-off value is: ",p_opt)

```

%% Cell type:code id: tags:

``` python

## Plot spectrum and indicate the cut-off value on the spectrum

```

%% Cell type:markdown id: tags:

<h2style="background-color:#f0b375;">

Section 5.0

<spanstyle=font-size:50%> Complete all problems in this and previous sections to get a grade of 5.0 </span>

</h2>

<pstyle="background-color:#adebad;">

Plot the distance matrices both for the denoised ($p = p^*$ -- from the previous step) and the original versions as shown in figure 5 in [1]. Note that the distance matrix is a matrix with pairwise distances between every two points from the dataset ($d_{ij} = dist(x_i, x_j)$).<br>

Perform K-MEANS algorithm for varying number of clusters K on the embedded vectors derrived from CSE. You may use the sklearn implementation of K-MEANS. To make the aforementioned plots meaningful, sort the nodes according to the cluster belongings for every number of clusters K (see the figure 5). For now, there is no need to include the actual ground truth labels given in [2].

</p>

%% Cell type:code id: tags:

``` python

## Distance matrices

```

%% Cell type:markdown id: tags:

<h2style="background-color:#f0b375;">

Section 5.5

<spanstyle=font-size:50%> Complete all problems in this and previous sections to get a grade of 5.5 </span>

</h2>

<pstyle="background-color:#adebad;">

Producing 2D and 3D embeddings allows us to nicely visualize generated clusters. Now calculate the embeddings for p = 2 (2D case) and p = 3 (3D case) and plot clusterings for a few values of K. Alternatively, you could use $p = p^*$ for more gentle denoising, cluster the denoised embeddings and only then apply a dimensionality reduction technique to get a plot in 2,3-dimensional space. You could use PCA, LLE, t-SNE etc. figure out what works for you. As an example see figure 6 (b) from [1] where CSE is combined with PCA.

</p>

%% Cell type:code id: tags:

``` python

## Get embeddings, run K-MEANS and generate plots

```

%% Cell type:code id: tags:

``` python

## p = 2

```

%% Cell type:code id: tags:

``` python

## p = 3

```

%% Cell type:code id: tags:

``` python

## choose p > 3, for example, p = p_opt, to compute CSE embeddings

## First, cluster the computed p-dimentional embeddings and then project them onto 2-dimensional space

## for visualization using PCA, LL, t-SNE or something else

```

%% Cell type:markdown id: tags:

<h2style="background-color:#f0b375;">

Section 6.0

<spanstyle=font-size:50%> Complete all problems in this and previous sections to get a grade of 6.0 </span>

</h2>

<pstyle="background-color:#adebad;">

Finally, to evaluate the quality of the above derived clusters, let's compare our predictions with the ground truth. We will use the actual member-institution mappings given in [2]. You can reuse code from the previous coding exercises to align the cluster labels with the ground truth.

print("The true number of clusters (departments) is: ",len(np.unique(AFFILIATIONS)))

```

%% Cell type:markdown id: tags:

<pstyle="background-color:#adebad;">

Visually or quantitatively, in a clever and convincing way, show that the K-MEANS generated clusters overlap with the ground truth clusters (member affiliations). How can we measure the overlapping of the predicted and true clusters?

</p>

%% Cell type:code id: tags:

``` python

## Here you can provide plots and calculations

```

%% Cell type:markdown id: tags:

Please, write here your explanations, observation and thoughts about results of the experiments above.

%% Cell type:markdown id: tags:

## Comments

We hope you found this exercise instructive.

Feel free to leave comments below, we will read them carefully.