CourseProject

Authors

Elizabeth Wang
Steven Pan

Proposal

NOTE: We've uploaded a PDF (Proposal.pdf) with the same information as we hae below:

What are the names and NetIDs of all your team members? Who is the captain?

Elizabeth Wang, eyw3, Captain
Steven Pan, stevenp6

Which paper have you chosen? We have chosen the following paper: “Generating semantic annotations for frequent patterns with context analysis”
Which programming language do you plan to use? Python
Can you obtain the datasets used in the paper for evaluation? No
If you answer “no” to Question 4, can you obtain a similar dataset (e.g. a more recent version of the same dataset, or another dataset that is similar in nature)? Yes, a more recent version of the dataset that derives from the dataset used in the paper can be found here: https://dblp.org/faq/1474681.html
If you answer “no” to Questions 4 & 5, how are you going to demonstrate that you have successfully reproduced the method introduced in the paper? N/A

Demo

https://www.youtube.com/watch?v=3v8M0sW3xHc

Setup

Install bs4, urllib, and nltk (if they're not already installed)
Run setup.sh (sh setup.sh) from CourseProject/ to

Build the csv file containing all (author list, title) entries. The code that builds this data file is here: utils/build_data_from_web.py. This script will create a directory called data/ and create a csv file called data.csv within that directory -- CSV file format: author1, author2, author3, ... etc, Title (where each line in the CSV file corresponds to a single paper)
Create the libs/ directory and download spmf.jar, which is a JAR file for the SPMF library (download link is also here: http://www.philippe-fournier-viger.com/spmf/index.php?link=download.php)
Builds frequent patterns for authors and title terms -- data/frequent_author_patterns.txt and data/frequent_title_term_patterns.txt, where all words are mapped to unique ids and the id mapping is cached in these 2 files respectively: data/author_id_mappings.txt and data/title_term_id_mappings.txt. The code that builds these files is here: utils/frequent_pattern_mining/build_frequent_patterns.py
Removes redundancies from sequential frequent title patterns (data/title_term_id_mappings.txt) and creates a new file called data/minimal_title_term_patterns.txt containing these minimal patterns
Builds files to cache all mutual information values between pairs of author patterns, between pairs of author-title patterns, and between pairs of title patterns

RELEVANT OUTPUT FILES FOR NEXT STAGE:

data/frequent_author_patterns.txt (ID mappings: data/author_id_mappings.txt)
data/minimal_title_term_patterns.txt (ID mappings: data/title_term_id_mappings.txt)
data/author_author_mutual_info_patterns.txt
data/author_title_mutual_info_patterns.txt
data/title_title_mutual_info_patterns.txt

NOTE: utils/parse_patterns.py contains utility methods to parse patterns into data structures and write them to files, you may find these methods useful

Name		Name	Last commit message	Last commit date
Latest commit History 144 Commits
pattern_annotators		pattern_annotators
utils		utils
.gitignore		.gitignore
Documentation-and-Software-Tutorial.pdf		Documentation-and-Software-Tutorial.pdf
Progress-Report.pdf		Progress-Report.pdf
Proposal.pdf		Proposal.pdf
README.md		README.md
__init__.py		__init__.py
setup.sh		setup.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

CourseProject

Authors

Proposal

Demo

Setup

About

Uh oh!

Releases

Packages

Languages

ElizWang/CourseProject

Folders and files

Latest commit

History

Repository files navigation

CourseProject

Authors

Proposal

Demo

Setup

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages