Converts Microsoft Word document files (.docx extension) to Markdown files.
pip install docx2md
usage: docx2md [-h] [-m] [-v] [--debug] SRC.docx DST.md
positional arguments:
SRC.docx Microsoft Word file to read
DST.md Markdown file to write
optional arguments:
-h, --help show this help message and exit
-m, --md_table use Markdown table notation instead of <table>
-v, --version show version
--debug for debug
A table is output as <table id="table(n)">. id is the order of output, starting from 1.
If --md_table is specified, the output will use |, but the title line item will be # fixed.
| # | # | # |
|---|---|---|
|a|b|c|
|d|e|f|
|g|h|i|
Images will be output as <img id="image(n)">.
The id is output in order starting from 1.
- source: example/example.docx
- result: example/README.me, example/media/*
- Tables (including merged cells)
- Lists (also with numbers as bullets)
- Headings
- Embedded images
- Page breaks (converted to
<div class="break"></div>) - Line breaks within paragraphs (converted to
<br>) - Text boxes (inserted in the body)
- Table of Contents
- Text decoration (bold and etc...)
- docx2md.do_convert
>>> help(docx2md.do_convert)
Help on function do_convert in module docx2md.convert:
do_convert(docx_file: str, target_dir='', use_md_table=False) -> str
convert docx_file to Markdown text and return it
Args:
docx_file(str): a file to parse
target_dir(str): save images into target_dir/media/ if specified
use_md_table(bool): use Markdown table notation instead of HTHML
Returns:
Markdown text(str)
- docx2md.DocxFile
- docx2md.DocxMedia
- docx2md.Converter
Refer to the do_convert implementation for the usage of each class.
def do_convert(docx_file: str, target_dir="", use_md_table=False) -> str:
try:
docx = DocxFile(docx_file)
media = DocxMedia(docx)
if target_dir:
media.save(target_dir)
converter = Converter(docx.document(), media, use_md_table)
return converter.convert()
except Exception as e:
return f"Exception: {e}"- 1.0.4 fix issue #6
- 1.0.3 add API
- 1.0.2 change packaging system to pyproject.toml