Skip to content

This is a python script that extracts the text from various file types: PDFs, Image files (.jpeg, .png etc), Word documents (.docx), Text files (.txt), Excel files (.xlsx, .xlsm, .xltx, .xltm) and CSV files.

Notifications You must be signed in to change notification settings

marmmo/TextExtractor

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

What is this project about?

This script contains goes through all the files in a directory, extracts the text from supported file types as string and returns it in a dictionary that had the filename as key and the string content as value.

Supported file types are: .pdf, .png, .jpg, .giff, .tiff , .bmp, .txt, .docx, .xlsx, .xlsm, .xltx, .xltm, .csv.

There is an additional function in a separate file that can be used to convert PDFs to image files (.jpg).

How to run this project?

To run this project, we first install the dependencies

 pip install -r requirements.txt

Then we can run the project using the command

python main.py

About

This is a python script that extracts the text from various file types: PDFs, Image files (.jpeg, .png etc), Word documents (.docx), Text files (.txt), Excel files (.xlsx, .xlsm, .xltx, .xltm) and CSV files.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages