Processamento de Dados Massivos/Projeto e implementação de aplicações Big Data/Processamento de streams de tweets: diferenças entre revisões

[edição não verificada][edição não verificada]
(Criou nova página com '=Geeser - A tool for processing raw tweet streams= ==Application Description== ===Definition=== Geeser Project intends to give a toolbox for data analyists to work with m...')
 
Storm is mainly composed by 2 types of processing units: Bolts and Spouts. Spouts generate streams, while Bolts consume and produce new streams in it's output. It's model is similar to the traditional producer-consumer approach seen on early networks applications. The simplicity of producer-consumer aproach works out to be one storm greatest strenghts.
 
[[Ficheiro:Storm-example.png|centro|Example of storm topology]]
{fig 1}
 
 
The graph generated by the conection of bolts and spouts is called topology. It maps what the cluster will do. It is a very simple abstraction that helps software engeneers to work the modules spearatelly. The comunication protocol between the processing units is based on tuples, similarlly to JSON. Therefore, in CAP Theorem, Storm attends to the Avalability and Partition Tolerance requirement. Because tuples might get out of order in the the process.
Storm's requires 3 main software components to work: Nimbus, Zookeeper, and Workers. Nimbus is the component responsible for code deployment on the worker nodes. Tha Apache Zookeeper is a software for load control on the nodes. Zookeeper load is quite low since its only function is to choose which node will process the next tuple. If fault tolerance is a requirement, the number of Zookeerper processes should be increased, for most cases, only one running is enough. For details on how to install such requirements, check on the Install section.
 
[[File:Storm-components.png|Storm Components]]
{fig 2}
 
The system bootstrap works as follows. All worker nodes report to the Zookeeper as soon as the code is submited to Nimbus. Then the binary code is submited to each worker node. When the worker nodes are ready to take a job, Zookeeper sends each node a tuple to be processed. And this is done until a spout sends a terminate signal.
The exclamation topology is a very simple topology that has only one objective: Add exclamation marks at the end of random words. In this example, we have two instances of the same object ExclamationBolt. The tuple in this case is just a simple string. One interesting fact in this example is that the order is not important so we can create a superscalar topology.
 
[[File:ExclamationTopology.png|Exclamation Topology]]
{fig 3}
 
In this case, we have 10 processes for the spout, 3 for Exclamation Bolt 1 and 2 for exclamation Bolt 2. The run is quite fast and the overhead is only done.
The Word Count topology is another simple topology that is used to count words in sentences. For that, a spout randomly emits a set of 5 different sentences. Then, there is a bolt implemented in Python to split the sentences. Finally, a bolt to count word frequencies:
 
[[File:WordCountTopology.png|WordCountTopology]]
{fig 4}
 
Maybe a good improvement in this Topology would be adding some persistency in the word count structure. That way it would be possible for another bolt to consult a certain word frequency. I am showing this on the following sections. For now, I am focusing on the sytax and the results and the implementation of this.
This is the most complex example in this section. In here it is necessary to implement methods that do the join considering that the tuples came unordered. For that, the communication buffer is used to wait until the correct example. Also, timeouts are used to solve starvation problems in this approach. Because of that Joins in Storm's topologies might introduce bottle necks that should be avoided at all costs
 
[[File:SingleJoinTopology.png|SingleJoinTopology]]
{fig 5}
 
===Requirements===
14

edições