Hi Eric,
On your second point "Is there a better way to do this"
You are going to use Spark Structured Streaming (SSS) to clean and enrich
the data and then push the messages to Kafka.
I assume you will be using foreachBatch in this case. What purpose is there
for Kafka to receive the enriched data from SSS? Any other reason except
hourly partition of your data?
HTH
view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.
On Thu, 29 Apr 2021 at 18:07, Eric Beabes <mailinglists19@gmail.com> wrote:
> We've a use case where lots of messages will come in via AWS SQS from
> various devices. We're thinking of reading these messages using Spark
> Structured Streaming, cleaning them up as needed & saving each message on
> Kafka. Later we're thinking of using Kafka S3 Connector to push them to S3
> on an hourly basis; meaning there will be a different directory for each
> hour. Challenge is that, within this hourly "partition" the messages need
> to be "sorted by" a certain field (let's say device_id). Reason being,
> we're planning to create an EXTERNAL table on it with BUCKETS on device_id.
> This will speed up the subsequent Aggregation jobs.
>
> Questions:
>
> 1) Does Kafka S3 Connector allow messages to be sorted by a particular
> field within a partition – or – do we need to extend it?
> 2) Is there a better way to do this?
>
On your second point "Is there a better way to do this"
You are going to use Spark Structured Streaming (SSS) to clean and enrich
the data and then push the messages to Kafka.
I assume you will be using foreachBatch in this case. What purpose is there
for Kafka to receive the enriched data from SSS? Any other reason except
hourly partition of your data?
HTH
view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.
On Thu, 29 Apr 2021 at 18:07, Eric Beabes <mailinglists19@gmail.com> wrote:
> We've a use case where lots of messages will come in via AWS SQS from
> various devices. We're thinking of reading these messages using Spark
> Structured Streaming, cleaning them up as needed & saving each message on
> Kafka. Later we're thinking of using Kafka S3 Connector to push them to S3
> on an hourly basis; meaning there will be a different directory for each
> hour. Challenge is that, within this hourly "partition" the messages need
> to be "sorted by" a certain field (let's say device_id). Reason being,
> we're planning to create an EXTERNAL table on it with BUCKETS on device_id.
> This will speed up the subsequent Aggregation jobs.
>
> Questions:
>
> 1) Does Kafka S3 Connector allow messages to be sorted by a particular
> field within a partition – or – do we need to extend it?
> 2) Is there a better way to do this?
>
Comments
Post a Comment