Description

Automated sitemap generator pipeline which automatically creates or amends sitemap files with new links and register them with popular crawlers like google.

Technologies

Apache Kafka: For real time updates of newly created links with relevant data
PostgreSQL: For persistance, locking and mapping with sitemap files
AWS S3: For persisting actual sitemap xml files

Dependencies

NestJS : A progressive Node.js framework for building efficient and scalable server-side applications.
Prisma.io : Next-generation Node.js and TypeScript ORM
kafkaJS : Kafka client for nodeJS

Getting Started

Create .env and override custom environment values

$ cat .env.example > .env

Install dependencies

$ npm install

Run DB Migrations

$ npx prisma migrate dev

Running the app

# development
$ npm run start

# watch mode
$ npm run start:dev

# production mode
$ npm run start:prod

Running the app via docker compose

The compose file provides:

Apache Kafka with Zookeeper and will automatically creating necessary topics
UI for kafka and will be accessible at port 8080
PostgreSQL 13 running on port 5432
UI for postgreSQL named Admirer which will be accessible at port 8081

docker compose -f docker-compose.yaml up -d

Test

# unit tests
$ npm run test

# e2e tests
$ npm run test:e2e

# test coverage
$ npm run test:cov

How it works

  sequenceDiagram
      KafkaProducer->>KafkaConsumer: Text Type Link Created Event Type
      Note over KafkaProducer,KafkaConsumer: When a new text/static link is created
      KafkaConsumer->>Postgres: Consume and write in table
      CronScheduler->>+Postgres: In interval fetch new text/static for sitemap generation
      Note over CronScheduler,Postgres: Determine the filename by using modulus base 10 of the id
      CronScheduler->>+s3: Fetch sitemap xml file from s3 if the file exists in database
      s3->>CronScheduler: return sitemap file if requested
      CronScheduler->>+s3: Update or Create new sitemap file in s3
      s3->>CronScheduler: Success Ack
      CronScheduler->>+Postgres: Write the file name for links in database
      Postgres->>CronScheduler: Success Ack
       CronScheduler->>+Google: Send sitemap file to Google for crawling
      CronScheduler->>+Postgres: Fetch new sitemap files waiting for linking with index file
      Postgres->>CronScheduler: New sitemap files
      CronScheduler->>+s3: Write new sitemap file links in corresponding sitemap index file in s3
      s3->>CronScheduler: Success Ack
      CronScheduler->>+Postgres: Write the sitemap index file name for new sitemap files in database
      Postgres->>CronScheduler: Success Ack
      CronScheduler->>+Google: Send sitemap file to Google for crawling
      CronScheduler->>+Postgres: Fetch new sitemap index files waiting for linking with robots.txt file
      Postgres->>CronScheduler: New sitemap index files
      CronScheduler->>+s3: Write new sitemap index file links in robots.txt file
      s3->>CronScheduler: Success Ack

Scalability

Every new link is partitioned based on the id (Expected it to be bigint/int) by modulus base 10. Hence each type of entity link creates 10 different sitemap files and follow naming convention as text-sitemap-{modulusHashBase}-{modulusHashValue}-{fileIncrement}.xml where modulusHashBase is 10 and modulusHashValue is id % 10 and fileIncrement is the number of files created for that type of entity which basically increment after every 50000 links which is the limit of urls you can have in a sitemap file.
Sitemap Index files also follow the same logic for partitioning and the file increment happens after 20000 sitemap urls in a sitemap index file. So in theory the service can be scaled for parallel processing of 10 individual processes. And the service is written in a way to handle the case where we want multiple nodes running for same modulusHashValue for more reliability. Crons lock the files via table locks so that no other processes is accessing the file at the same time.

License

Repository is GNU licensed.

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
.vscode		.vscode
prisma		prisma
src		src
template		template
test		test
.dockerignore		.dockerignore
.env.example		.env.example
.eslintrc.js		.eslintrc.js
.gitignore		.gitignore
.node-version		.node-version
.prettierrc		.prettierrc
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
docker-compose.yaml		docker-compose.yaml
nest-cli.json		nest-cli.json
package-lock.json		package-lock.json
package.json		package.json
tsconfig.build.json		tsconfig.build.json
tsconfig.json		tsconfig.json
validation.pipe.ts		validation.pipe.ts

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Description

Technologies

Dependencies

Getting Started

Running the app

Running the app via docker compose

Test

How it works

Scalability

License

About

Releases

Packages

Languages

License

gurbaj5124871/sitemap-service

Folders and files

Latest commit

History

Repository files navigation

Description

Technologies

Dependencies

Getting Started

Running the app

Running the app via docker compose

Test

How it works

Scalability

License

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages