Skip to content

PDFTextSearch is a Spring Boot backend service that extracts text from uploaded PDF documents using Apache Tika and indexes the extracted content into Elasticsearch for full-text search capabilities. Users can upload PDFs, search through their content, and retrieve matching documents.

Notifications You must be signed in to change notification settings

vahabov007/PDFTextSearch

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PDFTextSearch

A powerful Spring Boot backend service that provides full-text search capabilities on PDF documents using Apache Tika and Elasticsearch.

Features

  • PDF Upload - REST API for uploading and storing PDF files
  • Text Extraction - Apache Tika integration for PDF text extraction
  • Elasticsearch Indexing - Automatic indexing of extracted text
  • Full-text Search - Advanced search capabilities
  • Async Processing - Background processing for large files
  • Pagination - Support for paginated search results

Quick Start

Prerequisites

  • Java 17+
  • Maven 3.6+
  • Elasticsearch 8.x

1. Clone & Setup

git clone https://github.com/yourusername/PDFTextSearch.git
cd PDFTextSearch

2. Start Elasticsearch

docker run -d -p 9200:9200 -p 9300:9300 \
  -e "discovery.type=single-node" \
  -e "xpack.security.enabled=false" \
  docker.elastic.co/elasticsearch/elasticsearch:8.11.0

3. Run Application

mvn spring-boot:run

API Endpoints

Upload PDF

POST /api/pdf/upload
Content-Type: multipart/form-data
Body: file=@document.pdf

Upload PDF

POST /api/pdf/upload
Content-Type: multipart/form-data
Body: file=@document.pdf

Search PDFs

GET /api/pdf/search?query=spring+boot&page=0&size=10

Get All Documents

GET /api/pdf/all

Delete Document

DELETE /api/pdf/{id}

Techonology Stack

Java 17 Spring Boot 3.2.0 Apache Tika 2.9.1 Elasticsearch 8.x Spring Data Elasticsearch

Project Structure

src/main/java/com/example/pdftextsearch/
├── config/
   ├── ElasticsearchConfig.java
   └── AsyncConfig.java
├── controller/
   └── PDFController.java
├── model/
   └── PDFDocument.java
├── repository/
   └── PDFDocumentRepository.java
├── service/
   ├── PDFDocumentService.java
   ├── PDFTextExtractionService.java
   └── FileStorageService.java
└── PdfTextSearchApplication.java

Configuration

server:
  port: 8080

spring:
  data:
    elasticsearch:
      uris: http://localhost:9200

file:
  upload-dir: ./uploads

License

This project is licensed under the MIT License.

About

PDFTextSearch is a Spring Boot backend service that extracts text from uploaded PDF documents using Apache Tika and indexes the extracted content into Elasticsearch for full-text search capabilities. Users can upload PDFs, search through their content, and retrieve matching documents.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages