{ "cells": [ { "cell_type": "markdown", "id": "636885cc", "metadata": {}, "source": [ "# Week 05 - Nearest Neighbor and Locality-Sensitive Hashing\n", "**Computational Tools for Data Science - Fall, 2022**" ] }, { "cell_type": "code", "execution_count": 1, "id": "6c6e1237", "metadata": {}, "outputs": [], "source": [ "import numpy as np" ] }, { "cell_type": "markdown", "id": "91ea03c5", "metadata": {}, "source": [ "## Exercise 2 - `shingle` function\n", "\n", "**Implement a function `shingle` that take an integer `q` and a string and produces a list of shingles,\n", "where each shingle is a list of `q` words.**\n", "\n", "HINT: Here, the input string (a document in the case of the test files) would\n", "consist of separate words. The $q$-shingles for this string would be all of the sequences\n", "of $q$ consecutive words in the string. \n", "\n", "For example, if the string was\n", "`‘i want to go home do you want to’` and $q = 2$, then the q-shingles would be\n", "`{i want, want to, to go, go home, home do, do you, you want}`. \n", "\n", "Note that we do not include\n", "`‘want to’` twice because we only care about the set of shingles, we do not care about the\n", "order. Of course it will probably be output as a list which does have an order but there should\n", "not be repeats. \n", "\n", "For more details, see Section 3.2.1 and 3.2.2 in the online version of the book. Note that in the\n", "book they use sequences of $q$ characters rather than words, but the principle is the same.\n", "\n", "Extract from the book, to recall concepts (section 3.2.1):\n", "\n", "> A document is a string of characters. Define a k-shingle for a document to be\n", "any substring of length k found within the document. Then, we may associate\n", "with each document the set of k-shingles that appear one or more times within\n", "that document.\n", "\n", "> Example 3.3 : Suppose our document D is the string abcdabd, and we pick\n", "k = 2. Then the set of 2-shingles for D is {ab, bc, cd, da, bd}.\n", "Note that the substring ab appears twice within D, but appears only once\n", "as a shingle. A variation of shingling produces a bag, rather than a set, so each\n", "shingle would appear in the result as many times as it appears in the document.\n", "However, we shall not use bags of shingles here.\n", "\n", "> There are several options regarding how white space (blank, tab, newline,\n", "etc.) is treated. It probably makes sense to replace any sequence of one or more\n", "white-space characters by a single blank. That way, we distinguish shingles that\n", "cover two or more words from those that do not.\n", "\n", "> Example 3.4 : If we use k = 9, but eliminate whitespace altogether, then we\n", "would see some lexical similarity in the sentences “The plane was ready for\n", "touch down”. and “The quarterback scored a touchdown”. However, if we\n", "retain the blanks, then the first has shingles touch dow and ouch down, while\n", "the second has touchdown. If we eliminated the blanks, then both would have\n", "touchdown." ] }, { "cell_type": "code", "execution_count": 2, "id": "a2e27e92", "metadata": {}, "outputs": [], "source": [ "def shingle(aString, q, delimiter=' '):\n", " \"\"\"\n", " Input:\n", " - aString (str): string to split into shingles\n", " - q (int)\n", " - delimiter (str): string of the delimiter to consider to split the input string (default: space)\n", " Return: list of unique shingles\n", " \"\"\"\n", " all_shingles = []\n", " if delimiter != '':\n", " words_list = aString.split(delimiter)\n", " else:\n", " words_list = aString\n", " for i in range (len(words_list)-q+1):\n", " all_shingles.append(delimiter.join(words_list[i:i+q]))\n", " return list(set(all_shingles))" ] }, { "cell_type": "code", "execution_count": 11, "id": "96697989", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Initial string: abcdabd\n", ">> Shingles with q = 2 : ['bc', 'cd', 'ab', 'da', 'bd']\n", "\n", "Initial string: i want to go home do you want to\n", ">> Shingles with q = 2 : ['home do', 'you want', 'do you', 'go home', 'to go', 'i want', 'want to']\n" ] } ], "source": [ "# Example from the Book\n", "ex_string, q = 'abcdabd', 2\n", "ex_shingles = shingle(ex_string, q, delimiter='')\n", "print('Initial string:', ex_string)\n", "print(f'>> Shingles with q = {q} :',ex_shingles)\n", "\n", "# Example from the HINT\n", "ex_string, q = 'i want to go home do you want to', 2\n", "ex_shingles = shingle(ex_string, q)\n", "assert len(ex_shingles) == 7\n", "print('\\nInitial string:', ex_string)\n", "print(f'>> Shingles with q = {q} :',ex_shingles)" ] }, { "cell_type": "markdown", "id": "ba9080df", "metadata": {}, "source": [ "## Exercise 3 - Minhashing\n", "\n", "**3.1. Implement a minhash algorithm `minhash` that takes a list of shingles and a seed for the hash function\n", "mapping the shingles, and outputs the minhash. Feel free to use the `listhash` function in the template.**\n", "\n", "References to the Book:\n", "- See Chapter 3, Section 3.2.3 for more details about hashing shingles.\n", "- See Chapter 3, Section 3.3.2 for more details about minhashing.\n", "- See Chapter 3, Section 3.3.3\n", "\n", "HINT: Given a hash function $h$ that maps to integers, the minhash $h_{\\text{min}}$ of a set of\n", "elements $S$ is the minimum value that the function $h$ takes on the set $S$, i.e., $h_{\\text{min}}(S) = \\min \\{ h(S) : s \\in S \\}$. In this case the exercise suggests to use the `listhash` function provided. So the minhash function should take a list of shingles (and a seed for `listhash`) and compute the minhash for this list, i.e., it should compute\n", "`listhash` of every shingle in the list and return the minimum value." ] }, { "cell_type": "raw", "id": "fae4c92f", "metadata": {}, "source": [ "!pip install mmh3" ] }, { "cell_type": "code", "execution_count": 12, "id": "5c724c72", "metadata": {}, "outputs": [], "source": [ "from similarity import listhash" ] }, { "cell_type": "code", "execution_count": 13, "id": "f5363ab6", "metadata": {}, "outputs": [], "source": [ "def minhash(shingles_list, seed):\n", " \"\"\"\n", " Input:\n", " - shingles_list (list of str): set of hashes\n", " - seed (int): seed for listhash function\n", " Return: minhash of given shingles\n", " \"\"\"\n", " minhash_value = None\n", " for aShingle in shingles_list:\n", " hashcode = listhash([aShingle], seed)\n", " if minhash_value == None or hashcode < minhash_value:\n", " minhash_value = hashcode\n", " return minhash_value" ] }, { "cell_type": "code", "execution_count": 16, "id": "e0d92a10", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "MinHash of ['home do', 'you want', 'do you', 'go home', 'to go', 'i want', 'want to']: -2058595977\n" ] } ], "source": [ "print(f'MinHash of {ex_shingles}:', minhash(ex_shingles, 42))" ] }, { "cell_type": "markdown", "id": "649c1370", "metadata": {}, "source": [ "**3.2. Extend the minhash algorithm to output k different minhashes in an array. Use different seeds for each\n", "minhash, e.g., 1, . . . , k.**\n", "\n", "HINT: For the second part, just add $k$ as input and return a sequence of $k$ minhashes, each using a different seed for the hash function. Note that we care about the order of the minhashes here, because we will be comparing the values of the i-th minhash of a given list of shingles with the i-th minhash of another list of shingles." ] }, { "cell_type": "code", "execution_count": 17, "id": "c5f883aa", "metadata": {}, "outputs": [], "source": [ "def minhash2(shingles_list, k):\n", " \"\"\"\n", " Input:\n", " - shingles_list (list of str): set of hashes\n", " - k (int): seed for listhash function\n", " Return: sequence of k minhashes\n", " \"\"\"\n", " all_minhash = []\n", " for i in range(k):\n", " all_minhash.append(minhash(shingles_list, i))\n", " return all_minhash" ] }, { "cell_type": "code", "execution_count": 23, "id": "77f900b6", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "MinHash of ['home do', 'you want', 'do you', 'go home', 'to go', 'i want', 'want to'] with k = 10:\n", " [-1444075960, -1361643506, -1787085820, -773815373, -1475701122, -1751671059, -1905760484, -1765166633, -2127236308, -1658139371]\n" ] } ], "source": [ "k=10\n", "print(f'MinHash of {ex_shingles} with k = {k}:\\n', minhash2(ex_shingles, k))" ] }, { "cell_type": "markdown", "id": "68f36e4d", "metadata": {}, "source": [ "## Exercise 4 - Signatures\n", "\n", "**Construct a function `signatures` that takes the `docs` dictionary and outputs a new dictionary\n", "consisting of document id’s as keys and signatures as values.**\n", "\n", "HINT: This should take each document and convert it to a list of $q$-shingles (so $q$ should\n", "be an input to the function) and then use the function from 3.2 to convert that list of shingles\n", "for that document to a sequence of $k$ minhash values (so $k$ will also be an input) which is\n", "known as the signature of that document. See Section 3.3.4 (page 83) for info on minhash\n", "signatures.\n", "\n", "Note that it is not necessarily most efficient to simply compute the signature for each document\n", "separately. In the book in Section 3.3.5 (page 84), they give an algorithm for computing\n", "the signatures of a collection of sets/documents simultaneously. Let $U$ be the set of all shingles produced from a given set $D$\n", "of documents and let shingle($d$) denote the list of shingles produced from a given document\n", "$d \\in D$ (assume that these have already been computed prior to running the below code).\n", "Suppose we are creating signatures using minhashes from hash functions $h_1, ..., h_k$. The following\n", "creates a matrix SIG with rows indexed by $1, ... , k$, and columns indexed by $D$ such\n", "that the $d$ column is the signature of document $d$.\n", "\n", "![pseudocode](https://www.zupimages.net/up/22/44/gcp6.png)" ] }, { "cell_type": "code", "execution_count": 24, "id": "3b239bf7", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "read document calltounconv00baxt.txt\n", "read document gospeltruth00whit.txt\n", "read document lifeofrevrichard00baxt.txt\n", "read document memoirjamesbrai00ricegoog.txt\n", "read document practicalthought00nev.txt\n", "read document remember00palm.txt\n", "read document remembermeorholy00palm.txt\n", "read document thoughtsonpopery00nevi.txt\n" ] } ], "source": [ "import os\n", "\n", "def clean_text(aString):\n", " output = aString.replace('\\n','')\n", " output_list = output.split()\n", " output_list = [''.join(ch for ch in aWord if ch.isalnum()) for aWord in output_list]\n", " output_list = [s.lower() for s in output_list]\n", " output = ' '.join(output_list)\n", " return \" \".join(output.split())\n", "\n", "q = 3 # length of shingle\n", "k = 100 # number of minhashes\n", "docs = {} #dictionary mapping document id to document contents\n", "\n", "# read data sets\n", "srcfolder = os.path.abspath('.')\n", "datafolder = os.path.join(srcfolder, \"ats_corpus_small\") # change to ats_corpus for large data set\n", "\n", "for file in os.listdir(datafolder):\n", " filepath = os.path.join(datafolder, file)\n", " f = open(filepath, 'r')\n", " docs[file] = f.read()\n", " docs[file] = clean_text(docs[file])\n", " print (\"read document \" + file)\n", " f.close()" ] }, { "cell_type": "code", "execution_count": 26, "id": "1503ca35", "metadata": {}, "outputs": [], "source": [ "def signature(dict_docs, q = q, num_hashes = k):\n", " \"\"\"\n", " Input:\n", " - dict_docs (dict of str:str): dictionary of {title:document}\n", " - q (int)\n", " - num_hashes (int)\n", " Return: dictionary consisting of document id’s as keys and signatures as values\n", " \"\"\"\n", " dict_signatures = {}\n", " total_texts = len(list(dict_docs.keys()))\n", " counter = 1\n", " for key,text in dict_docs.items():\n", " print(f'{counter}/{total_texts} - {key} - Processing...')\n", " doc_shingles = shingle(text, q)\n", " minhash_values = minhash2(doc_shingles, num_hashes)\n", " dict_signatures[key] = minhash_values\n", " counter += 1\n", " return dict_signatures" ] }, { "cell_type": "code", "execution_count": 27, "id": "cb1cf106", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "1/8 - calltounconv00baxt.txt - Processing...\n", "2/8 - gospeltruth00whit.txt - Processing...\n", "3/8 - lifeofrevrichard00baxt.txt - Processing...\n", "4/8 - memoirjamesbrai00ricegoog.txt - Processing...\n", "5/8 - practicalthought00nev.txt - Processing...\n", "6/8 - remember00palm.txt - Processing...\n", "7/8 - remembermeorholy00palm.txt - Processing...\n", "8/8 - thoughtsonpopery00nevi.txt - Processing...\n" ] } ], "source": [ "dict_signatures_docs = signature(docs)" ] }, { "cell_type": "markdown", "id": "30443cba", "metadata": {}, "source": [ "## Exercise 5 - Jaccard Similarity\n", "\n", "**Implement a function `jaccard` that takes two document names and outputs the estimated\n", "Jaccard similarity using signatures.**\n", "\n", "HINT: Given two sets $S$ and $T$, the Jaccard similarity of $S$ and $T$ is equal to $\\frac{|S ∩\n", "T|}{|S ∪ T|}$, i.e., the size of the intersection divided by the size of the union of the two sets\n", "(see Section 3.1.1, page 74). For a random *injective* hash function $h$, the probability that\n", "$h_{\\text{min}}(S) = h_{\\text{min}}(T)$ is equal to the Jaccard similarity of $S$ and $T$ (see Section 3.3.3, page 83).\n", "The hash functions used in practice won’t necessarily be injective, but assuming there are not\n", "many collisions we can still use this as an estimation of Jaccard similarity. So this function\n", "should take two documents, convert each to a list of $q$-shingles (so $q$ should be an input) and\n", "then use the function from 3.2 to convert these into signatures consisting of $k$ minhash values\n", "(so $k$ will be an input). Then it estimates the Jaccard similarity of the two documents by\n", "counting how many positions of the two signatures agree and then dividing that by $k$." ] }, { "cell_type": "code", "execution_count": 36, "id": "1669606d", "metadata": {}, "outputs": [], "source": [ "def jaccard(name1, name2, signatures_dict):\n", " \"\"\"\n", " Input:\n", " - name1 (str): key of the first document S\n", " - name2 (str): key of the second document T\n", " - signatures_dict (dict of str:list): dictionary of signatures\n", " Return: Jaccard similarity between S and T\n", " \"\"\"\n", " signatures_doc1 = np.array(signatures_dict[name1])\n", " signatures_doc2 = np.array(signatures_dict[name2])\n", " return len(np.intersect1d(signatures_doc1, signatures_doc2))/len(np.union1d(signatures_doc1, signatures_doc2))" ] }, { "cell_type": "code", "execution_count": 38, "id": "6e66c026", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Jaccard similarity between calltounconv00baxt.txt and gospeltruth00whit.txt: 0.005025125628140704\n" ] } ], "source": [ "first_doc_key = list(docs.keys())[0]\n", "second_doc_key = list(docs.keys())[1]\n", "print(f'Jaccard similarity between {first_doc_key} and {second_doc_key}:', jaccard(first_doc_key, second_doc_key, dict_signatures_docs))" ] }, { "cell_type": "markdown", "id": "4e09aebf", "metadata": {}, "source": [ "## Exercise 6 - Find Similar Items\n", "\n", "**Implement a function `similar` that finds all pairs of documents whose estimated Jaccard\n", "similarity is ≥ 0.6. Test your program for different values of $k$ and $q$. Compare your results for most similar\n", "documents with your own visual impression of the similarity of files.**\n", "\n", "HINT: This just combines techniques from previous questions. You can use the function\n", "from Exercise 4 to convert all the documents to signatures, and then write a function that\n", "computes Jaccard similarity based on signatures (essentially the second half of the function\n", "from Exercise 5). Then compute/estimate the Jaccard similarity for all pairs of documents\n", "and return those with Jaccard similarity at least 0.6." ] }, { "cell_type": "code", "execution_count": 39, "id": "8545cf46", "metadata": {}, "outputs": [], "source": [ "def similar(signatures_dict, jaccard_threshold=0.6):\n", " \"\"\"\n", " Input:\n", " - signatures_dict (dict of str:list): dictionary of signatures\n", " - jaccard_threshold (float): lower bound for Jaccard similarity score to consider\n", " two documents as similar\n", " Return: dictionary of similar items\n", " \"\"\"\n", " list_keys = list(signatures_dict.keys())\n", " similar_items = {}\n", " for i in range (len(list_keys)-1):\n", " for j in range (i+1, len(list_keys)):\n", " similarity_score = jaccard(list_keys[i], list_keys[j], signatures_dict)\n", " if similarity_score >= jaccard_threshold:\n", " similar_items[(list_keys[i], list_keys[j])] = similarity_score\n", " return similar_items" ] }, { "cell_type": "code", "execution_count": 50, "id": "673fd941", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Found similar items:\n", " {('remember00palm.txt', 'remembermeorholy00palm.txt'): 0.6129032258064516}\n" ] } ], "source": [ "found_similar_items = similar(dict_signatures_docs)\n", "print('Found similar items:\\n', found_similar_items)" ] }, { "cell_type": "markdown", "id": "c74d99af", "metadata": {}, "source": [ "## Exercise 7 - Locality-Sensitive Hashing\n", "\n", "**Use locality-sensitive hashing to speed up your solution to the find similar item\n", "exercise.**\n", "\n", "HINT: Here, we take our signatures for all of our documents and break them up into $b$\n", "blocks of length $r$ (so $k = br$ where k is the length of the signatures). For each block, we apply\n", "a hash function to that part of the signature for each of the documents, and if two documents\n", "get hashed to the same value for at least one of the blocks, then they will be considered a\n", "candidate for being similar. We then compute the Jaccard similarity of only the candidate\n", "pairs using their full signatures. Note that the values of $b$ and $r$ need to be chosen correctly so\n", "that potential similar pairs are correctly identified. If one wants to identify pairs with Jaccard\n", "similarity at least some value $s$, then one should choose $b$ and $r$ such that $(\\frac{1}{b})^{\\frac{1}{r}} ≈ s$. See\n", "Section 3.4. (page 91) in the book." ] }, { "cell_type": "code", "execution_count": 45, "id": "b61194fe", "metadata": {}, "outputs": [], "source": [ "import mmh3" ] }, { "cell_type": "code", "execution_count": 46, "id": "556d5a2a", "metadata": {}, "outputs": [], "source": [ "b,r = 20, 5\n", "assert k == b*r\n", "\n", "def lsh(signatures_dict, jaccard_threshold=0.6, seed=42):\n", " lsh_dict = {}\n", " for key, values in signatures_dict.items():\n", " blocks = np.split(np.array(values), b)\n", " blocks_hash_values = []\n", " for aBlock in blocks:\n", " blocks_hash_values.append(mmh3.hash(aBlock, seed))\n", " lsh_dict[key] = blocks_hash_values\n", " list_keys = list(lsh_dict.keys())\n", " similar_items = {}\n", " for i in range (len(list_keys)-1):\n", " for j in range (i+1, len(list_keys)):\n", " common_values = np.intersect1d(lsh_dict[list_keys[i]], lsh_dict[list_keys[j]])\n", " if len(common_values) > 0:\n", " # we found a candidate\n", " similarity_score = jaccard(list_keys[i], list_keys[j], signatures_dict)\n", " if similarity_score >= jaccard_threshold:\n", " similar_items[(list_keys[i], list_keys[j])] = similarity_score\n", " return similar_items" ] }, { "cell_type": "code", "execution_count": 49, "id": "1cc805f6", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Found similar items with LSH:\n", " {('remember00palm.txt', 'remembermeorholy00palm.txt'): 0.6129032258064516}\n" ] } ], "source": [ "found_similar_items_with_lsh = lsh(dict_signatures_docs)\n", "print('Found similar items with LSH:\\n', found_similar_items_with_lsh)" ] }, { "cell_type": "markdown", "id": "c797186f", "metadata": {}, "source": [ "# THE END." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.8" } }, "nbformat": 4, "nbformat_minor": 5 }