ML Featurizer

Feature engineering is a difficult and time consuming process. ML Featurizer is a library to enable users to create additional features from raw data with ease. It extends and enriches the existing Spark's feature engineering functionality.

Featurizers provided by the library

Unary Temporal Featurizers
- DayOfWeekFeaturizer
- HourOfDayFeaturizer
- MonthOfYearFeaturizer
- PartsOfDayFeaturizer
- WeekendFeaturizer
Unary Numeric Featurizers
- LogTransformFeaturizer
- MathFeaturizer
- PowerTransformFeaturizer
Binary Temporal Featurizers
- DateDiffFeaturizer
Binary Numeric Featurizers
- AdditionFeaturizer
- DivisionFeaturizer
- MultiplicationFeaturizer
- SubtractionFeaturizer
Binary String Featurizers
- ConcateColumnsFeaturizer
Grouping Featurizers
- GroupByFeaturizer (count, ratio, min, max, count, avg, sum)
GEO Featurizers
- GeohashFeaturizer (convert latitude and longitude into geohash)

Examples:

Create day of week feature

object DayOfWeekFeaturizerExample {
  def main(args: Array[String]): Unit = {
    val spark = SparkSession.builder().appName("DayOfWeekFeaturizer").master("local").getOrCreate()

    val data = Array((0, "2018-01-02"),
      (1, "2018-02-02"),
      (2, "2018-03-02"),
      (3, "2018-04-05"),
      (3, "2018-05-05"))
    val dataFrame = spark.createDataFrame(data).toDF("id", "date")

    val featurizer = new DayOfWeekFeaturizer()
      .setInputCol("date")
      .setOutputCol("dayOfWeek")
      .setFormat("yyyy-MM-dd")

    val featurizedDataFrame = featurizer.transform(dataFrame)
    featurizedDataFrame.show()
  }
}

Use featurizers in Spark ML Pipeline

object FeaturePipeline {
  def main(args: Array[String]): Unit = {
    val spark = SparkSession.builder().appName("FeaturePipeline").master("local").getOrCreate()

    val data = Array((0, "2018-01-02", 1.0, 2.0, "mercedes"),
      (1, "2018-02-02", 2.5, 3.5, "lexus"),
      (2, "2018-03-02", 5.0, 1.0, "toyota"),
      (3, "2018-04-05", 8.0, 9.0, "tesla"),
      (4, "2018-05-05", 1.0, 5.0, "bmw"),
      (4, "2018-05-05", 1.0, 5.0, "bmw"))
    val dataFrame = spark.createDataFrame(data).toDF("id", "date", "price1", "price2", "brand")

    val dayOfWeekfeaturizer = new DayOfWeekFeaturizer()
      .setInputCol("date")
      .setOutputCol("dayOfWeek")
      .setFormat("yyyy-MM-dd")

    val monthOfYearfeaturizer = new MonthOfYearFeaturizer()
      .setInputCol("date")
      .setOutputCol("monthOfYear")
      .setFormat("yyyy-MM-dd")

    val weekendFeaturizer = new WeekendFeaturizer()
      .setInputCol("date")
      .setOutputCol("isWeekend")
      .setFormat("yyyy-MM-dd")

    val additionFeaturizer = new AdditionFeaturizer()
      .setInputCols("price1", "price2")
      .setOutputCol("price1_add_price2")

    val indexer = new StringIndexer()
      .setInputCol("brand")
      .setOutputCol("brandIndex")

    val encoder = new OneHotEncoder()
      .setInputCol("brandIndex")
      .setOutputCol("brandVector")

    val pipeline = new Pipeline()
      .setStages(Array(dayOfWeekfeaturizer, monthOfYearfeaturizer, weekendFeaturizer, additionFeaturizer,
        indexer, encoder))
    val model = pipeline.fit(dataFrame)
    model.transform(dataFrame).show()
  }
}

References:

An Empirical Analysis of Feature Engineering for Predictive Modeling

Contributing

If you're interested in contributing to this project, check out our contribution guidelines!

Name		Name	Last commit message	Last commit date
Latest commit History 43 Commits
python		python
src		src
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

ML Featurizer

Featurizers provided by the library

Examples:

Create day of week feature

Use featurizers in Spark ML Pipeline

References:

Contributing

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

License

adobe/ml-featurizer

Folders and files

Latest commit

History

Repository files navigation

ML Featurizer

Featurizers provided by the library

Examples:

Create day of week feature

Use featurizers in Spark ML Pipeline

References:

Contributing

About

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages