Instruction for Reproducing paper "Generating Semantic Annotations for Frequent Patterns with Context Analysis"
There are three folders, Datasets, PythonCodes and JupyterNoteBookDemo, in the repository. The PythonCodes and the JupyterNotebookDemo contain the same scripts but different file formats (py file and ipynb file). Within each of these two script folders, there are 6 folders giving the execution order for reproducing the paper using the DBLP dataset (paper section 5.1). All datasets imported and generated using the scripts are located in the Datasets Folder. The input and output paths of these Datasets files need to be changed if downloaded to your local computer. Except for the raw dataset "dblp50000.xml", all other datasets are generated by the scripts.
In addition, the final report, the link (https://mediaspace.illinois.edu/media/t/1_2uzja14v) to the video demo and the powerpoint slides of paper review and project introduction are also in the CourseProject repository.
numpy, scipy, pandas, nltk, csv, os, mlxend and pyspark are libraries needed for running the srcipts. Except for pyspark, all libraries can be downloaded and installed through pip or conda, depending on your preference and execution environment. The installation of pyspark (and Spark) is a bit complicated and requires some environmental configuration and functional java of version 8.0 or above. Here is the link of the tutorial how to install pyspark/Spark on Windows system:
https://www.datacamp.com/community/tutorials/installation-of-pyspark
Download "dblp50000.xml" and run script "DBLP_raw_data_parsing.py" or "DBLP_raw_data_parsing.ipynb". This generates the dataset "DBLP2000.csv".
Run script "Author_FP_mining.py" or "Author_FP_mining.ipynb" to import "DBLP2000.csv" and generate the dataset "authorsFP2000_with_index.csv", which contains 14 closed FPs of authors and their transaction index in "DBLP2000.csv" (e.g. author "Edwin R. Hancock" is a closed FP and it showed in the 839th, 1119th, 1127th, 1204th and 1576th row of DBLP2000, its transaction index list is [839, 1119, 1127, 1204, 1576] ).
Run script "DBPL_preprocessing_titles.py" or "DBPL_preprocessing_titles.ipynb" to import "DBLP2000.csv" and generate the dataset "DBLP2000_preprocessed_titles.txt". In this step, stop words are removed and the titles are stemmed.
Run "titles_seqPattern_mining.py" to import "DBLP2000_preprocessed_titles.txt" and to find closed sequential frequent patterns from titles of DBLP2000. I had issue with configuration of Spark in Jupyter Notebook environment therefore no corresponding script in "ipynb" format was put in the "JupyterNoteBookDemo\ContextModeling" directory . However, the python script was executed successfully in the windows cmd of my laptop. The script "titles_seqPattern_mining.py" generates an output folder containing the pattern file "part-00000". Set the "part-00000" file to txt format.
Run "Title_sequentialFP_processing.py" or "Title_sequentialFP_processing.ipynb" to import "part-00000.txt" and generate the cleaned dataset "titlesFP2000.csv".
Run "Find_transaction_index_of_title_FPs.py" or "Find_transaction_index_of_title_FPs.ipynb" to import "titlesFP2000.csv" and generate "titlesFP2000_with_index.csv", which adds the list of transaction index to each title pattern.
Run "Hierarchical_clustering_titleFPs2000.py" or "Hierarchical_clustering_titleFPs2000.ipynb" to import "titlesFP2000_with_index.csv" and generate "titlesFP2000_final.csv". This script apply the hierarchical clustering with Jaccard Distance defined per paper and clusters 1912 title sequential patterns into 166 clusters. It chooses the most frequent pattern in each cluster as the "centroid" pattern to further build the context unit space.
Run "DBLP2000_context_units_with_transaction_index.py" or "DBLP2000_context_units_with_transaction_index.ipynb" to import 'authorsFP2000_with_index.csv' and 'titlesFP2000_final.csv' and to generate the final context units dataset "DBLP2000_context_units.csv".
Run "Weighting_function.py" or "Weighting_function.ipynb" to import ""DBPL2000_context_units.csv" and "DBLP2000.csv" and generate "Context_units_weights.csv". This script generates context vectors for all context units defined in 2.2 and builds a weight matrix between the pairwised context FPs. Each element of the matrix is the Mutual Information score between the context unit pair per definition in the paper.
Run "Defining_pattern_with_context_units.py" or "Defining_pattern_with_context_units.ipynb" to import "Context_units_weights.csv". In this step, we first pick an author from the author FPs and rank the weights of its context vector. The context units with top 5 weights are selected as the definition of this author and are saved as "author_annotation_example1.csv".
Similarly, we pick a title from the title FPs and rank the weights of its context vector, and save the context units with top 5 weights as "title_annotation_example1.csv".
Run script "Find_representative_titles_to_pattern.py" or "Find_representative_titles_to_pattern.ipynb" to import "DBLP2000_context_units.csv", "DBLP2000.csv" and "Context_units_weights.csv". This script first generates the weight matrix of transactions (titles) in the context units space as the dataset "transaction_weights.csv", and then computes the cosine similarity between the transaction weight vectors and the given pattern weight vector (e.g. the same author and title chosen in 2.3.2). The similarity matrix of transaction to author FPs is saved as "similarity_scores_of_transaction_to_author.csv", and the similarity matrix of transaction to title FPs is saved as "similarity_scores_of_transaction_to_title.csv".
This script then generates the top 5 representative titles with highest similarity scores to the given author and to the given title pattern as dataset "rep_titles_author_example1.csv" and "rep_titles_title_example1.csv", respectively.
Run "Find_synonyms_of_pattern.py" or "Find_synonyms_of_pattern.ipynb" to import "Context_units_weights.csv" and to compute the cosine similarity between the candidate patterns of similarity (e.g. all closed frequent patterns of authors) and the given pattern (e.g. the same author and title chosen in 2.3.2). Select the authors with the highest 5 similarity scores as the synonyms of the given author or title other than the author or title itself. This script generates 2 datasets for synonyms of author pattern: "coauthor_to_author_example1.csv", "syn_titles_to_author_example1.csv", and 2 datasets for synonyms of title pattern: "syn_titles_to_title_example1.csv" and "syn_authors_to_title_example1.csv".
Finally, Run "Author_context_annotation_example1.py" (or "Author_context_annotation_example1.ipynb") and "Title_context_annotation_example1.py" (or "Title_context_annotation_example1.ipnb") to combine the output datasets generated in step 2.4, 2.5 and 2.6. This script builds the two examples of context annotation for the given author pattern and the given title pattern respectivley and fullfills the two experiments in paper section 5.1.