"High Level Architect
Client
v
API Gateway
v
Object Storage
v
Message Queue
v
Worker
v
Database
Client should can document with a web site or directly with API services.
API Gateway should be used for upload document,get document info and state.
Object storage should be used for original document and send event to Message Queue for starting.
Message Queue is neccessary because there are millions of document should be process each time.
Worker can get text from document with OCR.
Database shoul"
Berk C. - "High Level Architect
Client
v
API Gateway
v
Object Storage
v
Message Queue
v
Worker
v
Database
Client should can document with a web site or directly with API services.
API Gateway should be used for upload document,get document info and state.
Object storage should be used for original document and send event to Message Queue for starting.
Message Queue is neccessary because there are millions of document should be process each time.
Worker can get text from document with OCR.
Database shoul"See full answer
"Data lake and warehouse are both places that allow an organization to store large amounts of data.
When swimming in a lake, one would imagine that they come across all sorts of stuff - floating twigs, fish in the water, stones, chemicals and sometimes may be even a snake. Similarly, a data lake stores all forms of data that the company has without any indexing. The data is available at any time but needs to be first cleaned up and reorganized before it can be used for any type of analysis.
A"
Kshitij I. - "Data lake and warehouse are both places that allow an organization to store large amounts of data.
When swimming in a lake, one would imagine that they come across all sorts of stuff - floating twigs, fish in the water, stones, chemicals and sometimes may be even a snake. Similarly, a data lake stores all forms of data that the company has without any indexing. The data is available at any time but needs to be first cleaned up and reorganized before it can be used for any type of analysis.
A"See full answer
"Hadoop is better than PySpark when you are dealing with extremely large scale, batch oriented, non-iterative workloads where in-memory computing isn't feasible/ necessary, like log storage or ETL workflows that don't require high response times. It's also better in situations where the Hadoop ecosystem is already deeply embedded and where there is a need for resource conscious, fault tolerant computation without the overhead of Spark's memory constraints. In these such scenarios, Hadoop's disk-b"
Joshua R. - "Hadoop is better than PySpark when you are dealing with extremely large scale, batch oriented, non-iterative workloads where in-memory computing isn't feasible/ necessary, like log storage or ETL workflows that don't require high response times. It's also better in situations where the Hadoop ecosystem is already deeply embedded and where there is a need for resource conscious, fault tolerant computation without the overhead of Spark's memory constraints. In these such scenarios, Hadoop's disk-b"See full answer
"There are 2 questions popping into my mind:
Should the 2nd job have to kick off at 12:30AM?
Are there others depending on the 2nd job?
If both answers are no, we may simply postpone the second job to allow sufficient time for the first one to complete. If they are yeses, we could let the 2nd job retry to a certain amount of times. Make sure that even reaching the maximum of retries won't delay or fail the following jobs."
Anzhe M. - "There are 2 questions popping into my mind:
Should the 2nd job have to kick off at 12:30AM?
Are there others depending on the 2nd job?
If both answers are no, we may simply postpone the second job to allow sufficient time for the first one to complete. If they are yeses, we could let the 2nd job retry to a certain amount of times. Make sure that even reaching the maximum of retries won't delay or fail the following jobs."See full answer
"Delta lake is a metadata layer on top of cloud storage which helps giving datalake transactional capabilities. It helps implement upsert/merge as it conforms a schema to the data assets stored in cloud.
It also offers various other capabilities like liquid clustering,time travel, schema evolution,deletes."
Nitish C. - "Delta lake is a metadata layer on top of cloud storage which helps giving datalake transactional capabilities. It helps implement upsert/merge as it conforms a schema to the data assets stored in cloud.
It also offers various other capabilities like liquid clustering,time travel, schema evolution,deletes."See full answer
"All purpose cluster remains up and running for longer duration irrespective of the job hence preferred for notebooks, adhoc work whereas job cluster spins up as per the submitted job and shuts down post the completion hence preferred for production scheduled workloads as it also offers compute isolation"
Nitish C. - "All purpose cluster remains up and running for longer duration irrespective of the job hence preferred for notebooks, adhoc work whereas job cluster spins up as per the submitted job and shuts down post the completion hence preferred for production scheduled workloads as it also offers compute isolation"See full answer