DocumentAtom provides a light, fast library for breaking input documents into constituent parts (atoms), useful for text processing, analysis, and artificial intelligence.
| Package | Version | Downloads |
|---|---|---|
| DocumentAtom.Excel | ||
| DocumentAtom.Image | ||
| DocumentAtom.Markdown | ||
| DocumentAtom.Pdf | ||
| DocumentAtom.PowerPoint | ||
| DocumentAtom.Text | ||
| DocumentAtom.Word |
- Initial release
Parsing documents and extracting constituent parts is one part science and one part black magic. If you find ways to improve processing and extraction in any way that is horizontally useful, I'd would love your feedback on ways to make this library more accurate, more useful, faster, and overall better. My goal in building this library is to make it easier to analyze input data assets and make them more consumable by other systems including analytics and artificial intelligence.
Please feel free to file issues, enhancement requests, or start discussions about use of the library, improvements, or fixes.
DocumentAtom supports the following input file types:
- Text
- Markdown
- Microsoft Word (.docx)
- Microsoft Excel (.xlsx)
- Microsoft PowerPoint (.pptx)
- PNG images (requires Tesseract on the host)
Refer to the various Test projects for working examples.
The following example shows processing a markdown (.md) file.
using DocumentAtom.Core.Atoms;
using DocumentAtom.Markdown;
MarkdownProcessorSettings settings = new MarkdownProcessorSettings();
MarkdownProcessor processor = new MarkdownProcessor(_Settings);
foreach (MarkdownAtom atom in processor.Extract(filename))
Console.WriteLine(atom.ToString());DocumentAtom parses input data assets into a variety of Atom objects. Each Atom includes top-level metadata including:
GUIDType- includingText,Image,Binary,Table, andListPageNumber- where available; some document types do not explicitly indicate page numbers, and page numbers are inferred when renderedPosition- the ordinal position of theAtom, relative to othersLength- the length of theAtom's contentMD5Hash- the MD5 hash of theAtomcontentSHA1Hash- the SHA1 hash of theAtomcontentSHA256Hash- the SHA256 hash of theAtomcontentQuarks- sub-atomic particles created from theAtomcontent, for instance, when chunking text
The AtomBase class provides the aforementioned metadata, and several type-specific Atoms are returned from the various processors, including:
BinaryAtom- includes aBytespropertyDocxAtom- includesText,HeaderLevel,UnorderedList,OrderedList,Table, andBinarypropertiesImageAtom- includesBoundingBox,Text,UnorderedList,OrderedList,Table, andBinarypropertiesMarkdownAtom- includesFormatting,Text,UnorderedList,OrderedList, andTablepropertiesPdfAtom- includesBoundingBox,Text,UnorderedList,OrderedList,Table, andBinarypropertiesPptxAtom- includesTitle,Subtitle,Text,UnorderedList,OrderedList,Table, andBinarypropertiesTableAtom- includesRows,Columns,Irregular, andTablepropertiesTextAtom- includesTextXlsxAtom- includesSheetName,CellIdentifier,Text,Table, andBinaryproperties
Table objects inside of Atom objects are always presented as SerializableDataTable objects (see SerializableDataTable for more information) to provide simple serialization and conversion to native System.Data.DataTable objects.
DocumentAtom is built on the shoulders of several libraries, without which, this work would not be possible.
Each of these libraries were integrated as NuGet packages, and no source was included or modified from these packages.
My libraries used within DocumentAtom:
Please refer to CHANGELOG.md for version history.
Special thanks to iconduck.com and the content authors for producing this icon.