Services
- Consulting ServicesOn-demand technical expertise to move your IT project forward
- Software Development OutsourcingAccess the IT talent you need to fast-track your software development project and grow your business faster
- Business Software DevelopmentEfficient solutions create your competitive edge - consult your project with a proven partner
- AI Development CompanyAccelerate AI software development and drive tangible business results
- Software Product DevelopmentFrom Day 1 we prioritize delivering value to ship market-ready software in weeks not months
- Web Application DevelopmentBuild a robust, modern web application that scales with your evolving business needs
About Us
- Our Company Get to know our team and see
  if we're a good fit for your project
- The SoftKraft Way Discover our client-centric philosophy
  and proven development approach
- Our Clients Take a look at the clients we work for and industry experience we’ve gathered
- Commitment to Quality Discover how we prioritize quality at every level to deliver exceptional service
- How We Work From initial scoping call to final delivery, review our step-by-step process
Case Studies
Blog
Contact

Case study

Efficient Data Processing for Real Estate System

Using Apache Kafka to enhance an existing Apache Spark software system, increase the efficiency of property market analysts work and realize substantial data handling costs savings.

Client

TMC America Group
United States

Project Duration

3 months
2 people

Services

Data Engineering Company

Client Challenge

The client - a real estate investment firm - is involved in a spectrum of property-related projects which are very often served with custom-built software. The company, among other things, need to collect, process and analyse significant amounts of data gathered from real estate markets.

One of the client’s existing Apache Spark based software systems used for processing complex high-volume geo-located data sets caused the client to incur significant storage and processing costs. The goal of the project was to redesign the system architecture so that its costs could be substantially reduced. More specifically, the issue with the existing solution was that optimization was based on pre-processing data (grouping based on market polygon), which was later stored in HDFS. The client requested a new feature that required a different grouping strategy. The storage costs were already quite high and with the new grouping strategy they might easily double. On top of that, the new approach might as well lead to an increase in the time required to reindex all grouped data in HDSF - such an action was required due to the fact that market polygon was adjusted from time-to-time, which was beyond control.

All in all, the new solution architecture was to ensure efficient data processing and storage and, at the same time, drive significant cost savings.

Service Process

With the new approach in mind, we proposed to design and implement a more flexible and cost efficient solution based on Apache Kafka. The latter plays the role of a single source of truth not only for the backend but also for the front-end part of the system (implemented by a third party in RoR and React.js).

In short - all the data collected from different services are fed into Kafka; more specifically, a set of dedicated microservices process the data, serialize them to the Avro format and then GZIP them to reduce their size. Yet another service materializes part of the data to ElasticSearch, where an RoR application can make use of them. A reindexing issue was solved by producing specific events to Kafka - an event informs indexing processors about new market update or market removal or new property addition, property geolocation update or property removal. Such a system modification allowed to reduce the need for reindexing to the minimum, with the process currently running in almost real time.

The biggest advantage gained enables the system to reuse all SQLs used by the RoR application. The result was achieved by programming a pool of workers (Kafka consumers) which run on AWS spot instances . Based on the events listed earlier the workers filter out Kafka topics (each with about 400M records) and select only the subset of data needed to be processed rather than reindexing the whole dataset. In the next step, given SQL queries are executed against the selected data subset. The Kafka-consumer-based approach allowed to reduce the costs of infrastructure by close to 90%.

Project Results

In business terms, the Apache Kafka based modification of the system enabled the client to significantly reduce the report generation time, automated the process served by analysts driving the efficiency of their work, substantially decreased the costs of data processing and storage, and - last but not least - laid an architectural foundation for further developing the the client’s big data processing platform.

Deliverables

big data store based on Apache Kafka
data processors generating business reports for real estate analysts
geolocated index of real estate data based on ElasticSearch
flexible system architecture open to future enhancements

Benefits

report generation time reduced by an order of magnitude
partly automated report generation process driving analysts’ efficiency
significant data storage and processing cost savings
infrastructural foundation for a big data processing platform