A fast Python Common substrings of multiple strings library with C++ implementation

Having a bunch of strings, can I print some substrings which appear K times ? Can I know which is longest substrings ?

Installing

Make sure Cython is installed properly !

pip install commonstrings

Usuage and examples

Step 1. Import the library

from commonstrings import PyCommon_multiple_strings
tree = PyCommon_multiple_strings()  # init

Step 2. Build the data structure

Build from list of str:

tree.from_strings(list_str)

or build from file:

tree.from_path(<path_to_file>)

It is noted that sentences in file are presentened line-by-line. Each line is a sentence.

Sample code:

>>> from commonstrings import PyCommon_multiple_strings
>>> tree = PyCommon_multiple_strings()  # init
>>> list_str = [
       'abc',
       'abcxa',
       'xamnb',
       'yamnc',
       'abcd'
    ]
>>> tree.from_strings(list_str)

Step 3. Query

This library introduces 4 types of query:

a) List some substrings appear TIMES times:

tree.query(times=TIMES)

Ouput is a dictionary with key is an integer K, value is a list of some substrings which appear exactly K times.

Sample code:

>>> print(tree.query(times=(2, None)))
{2: ['amn', 'xa', 'n', 'mn'], 3: ['bc', 'abc'], 4: ['c', 'b'], 5: ['a']}
>>> print(tree.query(times=(3, 3)))
{3: ['bc', 'abc']}

b) Length of the longest substring appears TIMES times:

tree.length_longest_substring(times=TIMES)

Ouput is an integer.

Sample code:

>>> print(tree.length_longest_substring(times=(2, None)))
3

c) Lengths of the longest substrings appear TIMES times:

tree.lengths_longest_substrings(times=TIMES)

Ouput is a dictionary with key is an integer K, value is the length of the longest substring which appear exactly K times.

Sample code:

>>> print(tree.lengths_longest_substrings(times=(2, None)))
{2: 3, 3: 3, 4: 1, 5: 1}

d) List some substrings which have length of L and appear TIMES times:

tree.filter_substrings_by_length(length_input=L, times=TIMES)

Ouput is a dictionary with key is an integer K, value is a list of some substrings which appear exactly K times and have length of L.

Sample code:

>>> print(tree.filter_substrings_by_length(length_input=2, times=(2, None)))
{2: ['xa', 'mn'], 3: ['bc']}
>>> print(tree.filter_substrings_by_length(length_input=10, times=(2, None)))
{}

Some Notes:

- Params:

times, default (None, None) is a tuple represents the minimum and maximum appearances of the desired output.

Query substrings appear exactly N times, then times=(N, N).

Query substrings appear more than N times, then times=(N, None).

Query substrings appear less than N times, then times=(None, N).

Query substrings appear less than N times and more than M times, then times=(M, N).

length_input is an integer, represents the length of substrings

- This library accepts various utf8 characters including punctuations, numbers, upper-case characters, ... For more details, checkout `alphabet.cpp` file. Thanks coccoc-tokenizer for providing these available lists

- Query a) and d) does not output all possible results, but you can use it for analytic purposes.

Algorithm and Complexity

Data structure suffix tree is the core of this library. It encourages fast query and efficient storing.

If the total length of all input strings are L, then the average complexity should be L * log(L).

References

Algorithm from the lecture of String Algorithms and Algrorithms in Computational Biology - Gusfield

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
common_multiple_strings		common_multiple_strings
python		python
.clang-format		.clang-format
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
commonstrings.pyx		commonstrings.pyx
package.json		package.json
pyproject.toml		pyproject.toml
setup.py		setup.py
tea.yaml		tea.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

A fast Python Common substrings of multiple strings library with C++ implementation

Installing

Usuage and examples

Step 1. Import the library

Step 2. Build the data structure

Step 3. Query

a) List some substrings appear TIMES times:

b) Length of the longest substring appears TIMES times:

c) Lengths of the longest substrings appear TIMES times:

d) List some substrings which have length of L and appear TIMES times:

Some Notes:

- Params:

- This library accepts various utf8 characters including punctuations, numbers, upper-case characters, ... For more details, checkout `alphabet.cpp` file. Thanks coccoc-tokenizer for providing these available lists

- Query a) and d) does not output all possible results, but you can use it for analytic purposes.

Algorithm and Complexity

References

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

License

phamthivan2996/commonstrings

Folders and files

Latest commit

History

Repository files navigation

A fast Python Common substrings of multiple strings library with C++ implementation

Installing

Usuage and examples

Step 1. Import the library

Step 2. Build the data structure

Step 3. Query

a) List some substrings appear TIMES times:

b) Length of the longest substring appears TIMES times:

c) Lengths of the longest substrings appear TIMES times:

d) List some substrings which have length of L and appear TIMES times:

Some Notes:

- Params:

- This library accepts various utf8 characters including punctuations, numbers, upper-case characters, ... For more details, checkout alphabet.cpp file. Thanks coccoc-tokenizer for providing these available lists

- Query a) and d) does not output all possible results, but you can use it for analytic purposes.

Algorithm and Complexity

References

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

- This library accepts various utf8 characters including punctuations, numbers, upper-case characters, ... For more details, checkout `alphabet.cpp` file. Thanks coccoc-tokenizer for providing these available lists

Packages