Skip to content

bearnomore/CourseProject

 
 

Repository files navigation

Instruction for Reproducing paper "Generating Semantic Annotations for Frequent Patterns with Context Analysis"

1. Overview of the github CourseProject

There are three folders, Datasets, PythonCodes and JupyterNoteBookDemo, in the repository. The PythonCodes and the JupyterNotebookDemo contain the same scripts but different file formats (py file and ipynb file). Within each of these two script folders, there are 6 folders giving the execution order for reproducing the paper using the DBLP dataset (paper section 5.1). All datasets imported and generated using the scripts are located in the Datasets Folder. The input and output paths of these Datasets files need to be changed if downloaded to your local computer. Except for the raw dataset "dblp50000.xml", all other datasets are generated by the scripts.

In addition, the final report, the link (https://mediaspace.illinois.edu/media/t/1_2uzja14v) to the video demo and the powerpoint slides of paper review and project introduction are also in the CourseProject repository.

2. Python Library and Packages

numpy, scipy, pandas, nltk, csv, os, mlxend and pyspark are libraries needed for running the srcipts. Except for pyspark, all libraries can be downloaded and installed through pip or conda, depending on your preference and execution environment. The installation of pyspark (and Spark) is a bit complicated and requires some environmental configuration and functional java of version 8.0 or above. Here is the link of the tutorial how to install pyspark/Spark on Windows system:

https://www.datacamp.com/community/tutorials/installation-of-pyspark

3. Script running instruction

3.1. Parse the raw data (Folder 1. RawDataParsing)

Download "dblp50000.xml" and run script "DBLP_raw_data_parsing.py" or "DBLP_raw_data_parsing.ipynb". This generates the dataset "DBLP2000.csv".

3.2. Build the Context Units Space (Folder 2. ContextModeling)

3.2.1. Find closed Frequent Pattern (FP) for Authors using FPgrowth algorithm in MLXtend Lib.

Run script "Author_FP_mining.py" or "Author_FP_mining.ipynb" to import "DBLP2000.csv" and generate the dataset "authorsFP2000_with_index.csv", which contains 14 closed FPs of authors and their transaction index in "DBLP2000.csv" (e.g. author "Edwin R. Hancock" is a closed FP and it showed in the 839th, 1119th, 1127th, 1204th and 1576th row of DBLP2000, its transaction index list is [839, 1119, 1127, 1204, 1576] ).

3.2.2. Preprocess DBLP titles

Run script "DBPL_preprocessing_titles.py" or "DBPL_preprocessing_titles.ipynb" to import "DBLP2000.csv" and generate the dataset "DBLP2000_preprocessed_titles.txt". In this step, stop words are removed and the titles are stemmed.

3.2.3. Find Title sequential Pattern using PrefixSpan algorithm in PySpark

Run "titles_seqPattern_mining.py" to import "DBLP2000_preprocessed_titles.txt" and to find closed sequential frequent patterns from titles of DBLP2000. I had issue with configuration of Spark in Jupyter Notebook environment therefore no corresponding script in "ipynb" format was put in the "JupyterNoteBookDemo\ContextModeling" directory . However, the python script was executed successfully in the windows cmd of my laptop. The script "titles_seqPattern_mining.py" generates an output folder containing the pattern file "part-00000". Set the "part-00000" file to txt format.

Run "Title_sequentialFP_processing.py" or "Title_sequentialFP_processing.ipynb" to import "part-00000.txt" and generate the cleaned dataset "titlesFP2000.csv".

3.2.4. Find transaction index of title sequential patterns

Run "Find_transaction_index_of_title_FPs.py" or "Find_transaction_index_of_title_FPs.ipynb" to import "titlesFP2000.csv" and generate "titlesFP2000_with_index.csv", which adds the list of transaction index to each title pattern.

3.2.5. Reduce title FP redundancy by microclustering (hierarchical clustering)

Run "Hierarchical_clustering_titleFPs2000.py" or "Hierarchical_clustering_titleFPs2000.ipynb" to import "titlesFP2000_with_index.csv" and generate "titlesFP2000_final.csv". This script apply the hierarchical clustering with Jaccard Distance defined per paper and clusters 1912 title sequential patterns into 166 clusters. It chooses the most frequent pattern in each cluster as the "centroid" pattern to further build the context unit space.

3.2.6. Combine author FPs and title FPs to build the context units space

Run "DBLP2000_context_units_with_transaction_index.py" or "DBLP2000_context_units_with_transaction_index.ipynb" to import 'authorsFP2000_with_index.csv' and 'titlesFP2000_final.csv' and to generate the final context units dataset "DBLP2000_context_units.csv".

3.3. Define given frequent patterns using context units defined above (Folder 3. PatternDefinition)

3.3.1. Build weight vectors of FPs in the context unit space

Run "Weighting_function.py" or "Weighting_function.ipynb" to import ""DBPL2000_context_units.csv" and "DBLP2000.csv" and generate "Context_units_weights.csv". This script generates context vectors for all context units defined in 2.2 and builds a weight matrix between the pairwised context FPs. Each element of the matrix is the Mutual Information score between the context unit pair per definition in the paper.

3.3.2. Annotate the given FP (e.g. an author) by context units with highest weights

Run "Defining_pattern_with_context_units.py" or "Defining_pattern_with_context_units.ipynb" to import "Context_units_weights.csv". In this step, we first pick an author from the author FPs and rank the weights of its context vector. The context units with top 5 weights are selected as the definition of this author and are saved as "author_annotation_example1.csv".

Similarly, we pick a title from the title FPs and rank the weights of its context vector, and save the context units with top 5 weights as "title_annotation_example1.csv".

3.4. Find representative titles of the given pattern (Folder 4. RepresentativeTitles2Pattern)

Run script "Find_representative_titles_to_pattern.py" or "Find_representative_titles_to_pattern.ipynb" to import "DBLP2000_context_units.csv", "DBLP2000.csv" and "Context_units_weights.csv". This script first generates the weight matrix of transactions (titles) in the context units space as the dataset "transaction_weights.csv", and then computes the cosine similarity between the transaction weight vectors and the given pattern weight vector (e.g. the same author and title chosen in 2.3.2). The similarity matrix of transaction to author FPs is saved as "similarity_scores_of_transaction_to_author.csv", and the similarity matrix of transaction to title FPs is saved as "similarity_scores_of_transaction_to_title.csv".

This script then generates the top 5 representative titles with highest similarity scores to the given author and to the given title pattern as dataset "rep_titles_author_example1.csv" and "rep_titles_title_example1.csv", respectively.

3.5. Find synonyms of the given pattern (Folder 5. Synonyms2Pattern)

Run "Find_synonyms_of_pattern.py" or "Find_synonyms_of_pattern.ipynb" to import "Context_units_weights.csv" and to compute the cosine similarity between the candidate patterns of similarity (e.g. all closed frequent patterns of authors) and the given pattern (e.g. the same author and title chosen in 2.3.2). Select the authors with the highest 5 similarity scores as the synonyms of the given author or title other than the author or title itself. This script generates 2 datasets for synonyms of author pattern: "coauthor_to_author_example1.csv", "syn_titles_to_author_example1.csv", and 2 datasets for synonyms of title pattern: "syn_titles_to_title_example1.csv" and "syn_authors_to_title_example1.csv".

3.6. A final display of the context annotation of the given pattern (Folder 6. ContextAnnotation)

Finally, Run "Author_context_annotation_example1.py" (or "Author_context_annotation_example1.ipynb") and "Title_context_annotation_example1.py" (or "Title_context_annotation_example1.ipnb") to combine the output datasets generated in step 2.4, 2.5 and 2.6. This script builds the two examples of context annotation for the given author pattern and the given title pattern respectivley and fullfills the two experiments in paper section 5.1.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Jupyter Notebook 93.9%
  • Python 6.1%