21-765 MidTerm Feedback

<INDEX>

Midterm Comments

Problem definition

Find groups of 2 or more consecutive characters (as many as possible, including spaces or symbols) which can be found in at least 60% of the entries in a column of a csv file.

as many as possible -- characters? groups?

Matching substrings in column '018am_19pm_dist': ['m dist', ' d', 'pm', 'pm di', 'am ', 'pm dis', 'ist', 'st', '
di', 'dis', 'dist', 'm 1', 'm dis', ' dis', 'is', 'm di', 'pm d', 'di', 'am', 'pm ', 'm d', ' dist', 'm ', 'am 1', ' 1', 'pm
dist']

-vs-

Matching substrings in column '018am_19pm_dist': [ 'am 1', 'pm dist']

User communication

API description

find_matching_substrings() function takes a column of data and performs the core algorithmic work of searching for matching substrings according to specified criteria. It iterates over the column data until all possible matches are found. The results are then prepared for handoff back to the Kernel module.

find_matching_substrings(column, min_length=2, match_percentage=60, greedy_match=0)
   column - name of column
   min_length=2 - min matching substring length (default = 2)
   match_percentage - min match percentage (default = 60%)
   greedy_match=0 - detect only the longer matches (greedy_match = 1) or all submatches (default)

<INDEX>