Before you say that the production cluster is ready to work, there are a number of requirements that are useful to consider, so in the future, do not panic when something goes wrong! =)
Let's start with the first point - Cluster infrastructure and architecture. Taking into account that this is the first launch and we are supposed to process up to 50 GB of data per day, we can optimally configure the Apache NiFi cluster, providing a balance between performance and cost-effectiveness.
For our data volume at launch, two or three nodes in an Apache NiFi cluster will be sufficient, preferably 3 nodes. This will allow for load balancing, and fault tolerance, and avoid bottlenecks as the data volume grows. A three-node configuration is preferred to ensure stability and minimize downtime.
NiFi nodes can perform both work tasks and coordination tasks. ZooKeeper is used to coordinate the work of the nodes.
Disk system performance is key, especially for storing content and data streams, as NiFi makes heavy use of the disk subsystem for temporary file storage.
A reliable and fast network is essential for proper cluster operation.
Let's move on to the next step! Now, let's take a detailed look at the Apache NiFi configuration to prepare the cluster for production operation.
The main configuration files that need to be set up for the NiFi cluster to run include:
nifi.properties.
The nifi.properties file contains the basic parameters that need to be configured for the cluster to work.
Cluster Configuration
nifi.cluster.is.node=true: Enables cluster mode for the NiFi node.
nifi.cluster.node.address=HOSTNAME: Set the IP address or hostname of the node.
nifi.cluster.node.node.protocol.port=PORT: Set the port on which the node will accept cluster requests. For example, 9999.
nifi.zookeeper.connect.string=ZK_HOST1:2181,ZK_HOST2:2181,ZK_HOST3:2181: Specify the addresses of the ZooKeeper nodes, separated by commas.
nifi.cluster.flow.election.max.wait.time=5 mins: Wait time to synchronize flows between nodes.
nifi.cluster.flow.election.max.candidates=1: Specifies the number of nodes that are allowed to change the flow configuration at the same time. Normally this is 1.
Web Interface Configuration
nifi.web.http.host=HOSTNAME: Set the IP address or hostname for the web interface.
nifi.web.http.port=8080: Port for the web interface (if using HTTP).
For HTTPS:
nifi.web.https.host=HOSTNAME
nifi.web.https.port=8443
nifi.security.keystore=/path/to/keystore.jks
nifi.security.keystoreType=JKS
nifi.security.keystorePasswd=your_keystore_password
nifi.security.keyPasswd=your_key_password
nifi.security.truststore=/path/to/truststore.jks
nifi.security.truststoreType=JKS
nifi.security.truststorePasswd=your_truststore_password
Repository directories
nifi.content.repository.directory.default=./content_repository: Path to the directory for content storage.
nifi.flowfile.repository.directory=./flowfile_repository: Path to the directory for storing flow metadata.
nifi.provenance.repository.directory.default=./provenance_repository: Path to the directory to store provenance data.
Provenance Settings
nifi.queue.swap.threshold=20000: The number of FlowFiles after which they will be offloaded to disk.
nifi.provenance.repository.rollover.time=30 secs: The time after which new Provenance repository files will be created.
nifi.provenance.repository.max.storage.time=24 hours: Maximum storage time for Provenance.
nifi.provenance.repository.max.storage.size=10 GB: The maximum amount of data in Provenance. Adjust based on available disk space.
Configuration for reliability and fault tolerance
nifi.cluster.protocol.heartbeat.interval=5 secs: Interval for sending heartbeat between nodes.
nifi.cluster.protocol.is.secure=true: Enables encryption of traffic between nodes.
zookeeper.properties.
Using an external ZooKeeper cluster, it should be configured to provide secure coordination between nodes.
tickTime=2000: The basic time interval in milliseconds that ZooKeeper uses to determine state.
initLimit=10: The maximum number of ticks (tickTime) for which nodes should synchronize.
syncLimit=5: Maximum number of tickTimes during which nodes can be unsynchronized.
server.1=HOST1:2888:3888: Defines the ZooKeeper nodes. The first port is used for communication between nodes, the second port is used for leader election.
authorizers.xml.
This file manages access to NiFi resources, including users and groups.
Access Management: Define roles and groups as well as appropriate access policies to secure the system.
<authorizers>
<userGroupProvider>
<identifier=file-user-group-provider</identifier>
<class>org.apache.nifi.authorization.FileUserGroupProvider</class>
<property name="Users File">./conf/users.xml</property>
</userGroupProvider>
<accessPolicyProvider>
<identifier=file-access-policy-provider</identifier>
<class>org.apache.nifi.authorization.FileAccessPolicyProvider</class>
<property name="Authorizations File">./conf/authorizations.xml</property>
</accessPolicyProvider>
</authorizers>
login-identity-providers.xml
This file is used to configure authentication. You can configure LDAP, Kerberos, or other authentication mechanisms.
<loginIdentityProviders>
<provider>
<identifier="ldap-provider"/>
<class>org.apache.nifi.ldap.LdapProvider</class>
<property name="Authentication Strategy">SIMPLE</property>
<property name="Manager DN">cn=admin,dc=example,dc=com</property>
<property name="Manager Password">admin_password</property>
<property name="Url">ldap://ldap.example.com:389</property>
<property name="User Search Base">ou=users,dc=example,dc=com</property>
<property name="User Search Filter">uid={0}</property>
<property name="Identity Strategy">USE_USERNAME</property>
</provider>
</loginIdentityProviders>
After configuring the configuration files, you must restart all NiFi nodes for the changes to take effect. Verify that all nodes have successfully connected to the cluster and synchronized the configuration.
The hardware is in place and the configs are in place, now let's delve into security...
Security in Apache NiFi covers several areas: encryption, authentication, authorization, and access control. Let's look at them in order.
To configure TLS/SSL:
It is recommended to use proven and reliable mechanisms to authenticate users and systems that connect to NiFi.
An organization uses LDAP, you can integrate it with NiFi to manage users and groups. LDAP configuration is defined in login-identity-providers.xml
<loginIdentityProviders>
<provider>
<identifier="ldap-provider"/>
<class>org.apache.nifi.ldap.LdapProvider</class>
<property name="Authentication Strategy">SIMPLE</property>
<property name="Manager DN">cn=admin,dc=example,dc=com</property>
<property name="Manager Password">admin_password</property>
<property name="Url">ldap://ldap.example.com:389</property>
<property name="User Search Base">ou=users,dc=example,dc=com</property>
<property name="User Search Filter">uid={0}</property>
<property name="Identity Strategy">USE_USERNAME</property>
</provider>
</loginIdentityProviders>
There is also a possibility via Kerberos, but I'd like to talk about that in the next article =)
Once authentication is configured, it is important to properly configure authorization - access to resources and operations in NiFi.
Let's define access policies for users and groups to restrict access to critical NiFi components. Roles and policies are configured in authorizers.xml.
Example of a policy definition:
<authorizers>
<userGroupProvider>
<identifier=file-user-group-provider</identifier>
<class>org.apache.nifi.authorization.FileUserGroupProvider</class>
<property name="Users File">./conf/users.xml</property>
</userGroupProvider>
<accessPolicyProvider>
<identifier=file-access-policy-provider</identifier>
<class>org.apache.nifi.authorization.FileAccessPolicyProvider</class>
<property name="Authorizations File">./conf/authorizations.xml</property>
</accessPolicyProvider>
</authorizers>
After security configuration, it is recommended to test the system for vulnerabilities and perform load testing. This will help ensure that all aspects of security are working properly and the system is ready for use.
At this point, your Apache NiFi cluster setup should be complete and ready to run in a production environment. If you have questions or need additional details, I'm here to help!